CS236781: Deep Learning on Computational Accelerators¶

Homework Assignment 2¶

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 Boaz Maron 215386103 boaz.maron@campus.technion.ac.il
Student 2 Michael Beygel 213764251 mbeygel@campus.technion.ac.il

Introduction¶

In this assignment we'll create a from-scratch implementation of two fundemental deep learning concepts: the backpropagation algorithm and stochastic gradient descent-based optimizers. In addition, you will create a general-purpose multilayer perceptron, the core building block of deep neural networks. We'll visualize decision bounrdaries and ROC curves in the context of binary classification. Following that we will focus on convolutional networks with residual blocks. We'll create our own network architectures and train them using GPUs on the course servers, then we'll conduct architecture experiments to determine the the effects of different architectural decisions on the performance of deep networks.

General Guidelines¶

  • Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
  • Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
  • The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (unless you wish to play around). The only exception is to fill your name(s) in the above cell before submission. Please do not remove sections or change the order of any cells.
  • All your code (and even answers to questions) should be written in the files within the python package corresponding the assignment number (hw1, hw2, etc). You can of course use any editor or IDE to work on these files.

Contents¶

  • Part 1: Backpropagation
  • Part 2: Optimization and Training:
  • Part 3: Binary Classification with Multilayer Perceptrons
  • Part 4: Convolutional Neural Networks:
  • Part 5: Convolutional Architecture Experiments
  • Part 6: YOLO - Object Detection

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 1: Backpropagation¶

In this part, we'll implement backpropagation and automatic differentiation from scratch and compare our implementations to PyTorch's built in implementation (autograd).

In [1]:
import torch
import unittest

%load_ext autoreload
%autoreload 2

test = unittest.TestCase()

Reminder: The backpropagation algorithm is at the core of training deep models. To state the problem we'll tackle in this notebook, imagine we have an L-layer MLP model, defined as $$ \hat{\vec{y}^i} = \vec{y}_L^i= \varphi_L \left( \mat{W}_L \varphi_{L-1} \left( \cdots \varphi_1 \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right) \cdots \right) + \vec{b}_L \right), $$

a pointwise loss function $\ell(\vec{y}, \hat{\vec{y}})$ and an empirical loss over our entire data set, $$ L(\vec{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \ell(\vec{y}^i, \hat{\vec{y}^i}) + R(\vec{\theta}) $$

where $\vec{\theta}$ is a vector containing all network parameters, e.g. $\vec{\theta} = \left[ \mat{W}_{1,:}, \vec{b}_1, \dots, \mat{W}_{L,:}, \vec{b}_L \right]$.

In order to train our model we would like to calculate the derivative (or gradient, in the multivariate case) of the loss with respect to each and every one of the parameters, i.e. $\pderiv{L}{\mat{W}_j}$ and $\pderiv{L}{\vec{b}_j}$ for all $j$. Since the gradient "points" to the direction of functional increase, the negative gradient is often used as a descent direction for descent-based optimization algorithms. In other words, iteratively updating each parameter proportianally to it's negetive gradient can lead to convergence to a local minimum of the loss function.

Calculus tells us that as long as we know the derivatives of all the functions "along the way" ($\varphi_i(\cdot),\ \ell(\cdot,\cdot),\ R(\cdot)$) we can use the chain rule to calculate the derivative of the loss with respect to any one of the parameter vectors. Note that if the loss $L(\vec{\theta})$ is scalar (which is usually the case), the gradient of a parameter will have the same shape as the parameter itself (matrix/vector/tensor of same dimensions).

For deep models that are a composition of many functions, calculating the gradient of each parameter by hand and implementing hard-coded gradient derivations quickly becomes infeasible. Additionally, such code makes models hard to change, since any change potentially requires re-derivation and re-implementation of the entire gradient function.

The backpropagation algorithm, which we saw in the lecture, provides us with a effective method of applying the chain rule recursively so that we can implement gradient calculations of arbitrarily deep or complex models.

We'll now implement backpropagation using a modular approach, which will allow us to chain many components layers together and get automatic gradient calculation of the output with respect to the input or any intermediate parameter.

To do this, we'll define a Layer class. Here's the API of this class:

In [2]:
import hw2.layers as layers
help(layers.Layer)
Help on class Layer in module hw2.layers:

class Layer(abc.ABC)
 |  A Layer is some computation element in a network architecture which
 |  supports automatic differentiation using forward and backward functions.
 |  
 |  Method resolution order:
 |      Layer
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __call__(self, *args, **kwargs)
 |      Call self as a function.
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  backward(self, dout)
 |      Computes the backward pass of the layer, i.e. the gradient
 |      calculation of the final network output with respect to each of the
 |      parameters of the forward function.
 |      :param dout: The gradient of the network with respect to the
 |      output of this layer.
 |      :return: A tuple with the same number of elements as the parameters of
 |      the forward function. Each element will be the gradient of the
 |      network output with respect to that parameter.
 |  
 |  forward(self, *args, **kwargs)
 |      Computes the forward pass of the layer.
 |      :param args: The computation arguments (implementation specific).
 |      :return: The result of the computation.
 |  
 |  params(self)
 |      :return: Layer's trainable parameters and their gradients as a list
 |      of tuples, each tuple containing a tensor and it's corresponding
 |      gradient tensor.
 |  
 |  train(self, training_mode=True)
 |      Changes the mode of this layer between training and evaluation (test)
 |      mode. Some layers have different behaviour depending on mode.
 |      :param training_mode: True: set the model in training mode. False: set
 |      evaluation mode.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset({'backward', 'forward', 'params'})

In other words, a Layer can be anything: a layer, an activation function, a loss function or generally any computation that we know how to derive a gradient for.

Each Layer must define a forward() function and a backward() function.

  • The forward() function performs the actual calculation/operation of the block and returns an output.
  • The backward() function computes the gradient of the input and parameters as a function of the gradient of the output, according to the chain rule.

Here's a diagram illustrating the above explanation:

No description has been provided for this image

Note that the diagram doesn't show that if the function is parametrized, i.e. $f(\vec{x},\vec{y})=f(\vec{x},\vec{y};\vec{w})$, there are also gradients to calculate for the parameters $\vec{w}$.

The forward pass is straightforward: just do the computation. To understand the backward pass, imagine that there's some "downstream" loss function $L(\vec{\theta})$ and magically somehow we are told the gradient of that loss with respect to the output $\vec{z}$ of our block, i.e. $\pderiv{L}{\vec{z}}$.

Now, since we know how to calculate the derivative of $f(\vec{x},\vec{y};\vec{w})$, it means we know how to calculate $\pderiv{\vec{z}}{\vec{x}}$, $\pderiv{\vec{z}}{\vec{y}}$ and $\pderiv{\vec{z}}{\vec{w}}$ . Thanks to the chain rule, this is all we need to calculate the gradients of the loss w.r.t. the input and parameters:

$$ \begin{align} \pderiv{L}{\vec{x}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{x}}\\ \pderiv{L}{\vec{y}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{y}}\\ \pderiv{L}{\vec{w}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}} \end{align} $$

Comparison with PyTorch¶

PyTorch has the nn.Module base class, which may seem to be similar to our Layer since it also represents a computation element in a network. However PyTorch's nn.Modules don't compute the gradient directly, they only define the forward calculations. Instead, PyTorch has a more low-level API for defining a function and explicitly implementing it's forward() and backward(). See autograd.Function. When an operation is performed on a tensor, it creates a Function instance which performs the operation and stores any necessary information for calculating the gradient later on. Additionally, Functionss point to the other Function objects representing the operations performed earlier on the tensor. Thus, a graph (or DAG) of operations is created (this is not 100% exact, as the graph is actually composed of a different type of class which wraps the backward method, but it's accurate enough for our purposes).

A Tensor instance which was created by performing operations on one or more tensors with requires_grad=True, has a grad_fn property which is a Function instance representing the last operation performed to produce this tensor. This exposes the graph of Function instances, each with it's own backward() function. Therefore, in PyTorch the backward() function is called on the tensors, not the modules.

Our Layers are therefore a combination of the ideas in Module and Function and we'll implement them together, just to make things simpler. Our goal here is to create a "poor man's autograd": We'll use PyTorch tensors, but we'll calculate and store the gradients in our Layers (or return them). The gradients we'll calculate are of the entire block, not individual operations on tensors.

To test our implementation, we'll use PyTorch's autograd.

Note that of course this method of tracking gradients is much more limited than what PyTorch offers. However it allows us to implement the backpropagation algorithm very simply and really see how it works.

Let's set up some testing instrumentation:

In [3]:
from hw2.grad_compare import compare_layer_to_torch

def test_block_grad(block: layers.Layer, x, y=None, delta=1e-3):
    diffs = compare_layer_to_torch(block, x, y)
    
    # Assert diff values
    for diff in diffs:
        test.assertLess(diff, delta)

# Show the compare function
compare_layer_to_torch??

Notes:

  • After you complete your implementation, you should make sure to read and understand the compare_layer_to_torch() function. It will help you understand what PyTorch is doing.
  • The value of delta above should not be needed. A correct implementation will give you a diff of exactly zero.

Layer Implementations¶

We'll now implement some Layers that will enable us to later build an MLP model of arbitrary depth, complete with automatic differentiation.

For each block, you'll first implement the forward() function. Then, you will calculate the derivative of the block by hand with respect to each of its input tensors and each of its parameter tensors (if any). Using your manually-calculated derivation, you can then implement the backward() function.

Notice that we have intermediate Jacobians that are potentially high dimensional tensors. For example in the expression $\pderiv{L}{\vec{w}} = \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$, the term $\pderiv{\vec{z}}{\vec{w}}$ is a 4D Jacobian if both $\vec{z}$ and $\vec{w}$ are 2D matrices.

In order to implement the backpropagation algorithm efficiently, we need to implement every backward function without explicitly constructing this Jacobian. Instead, we're interested in directly calculating the vector-Jacobian product (VJP) $\pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$. In order to do this, you should try to figure out the gradient of the loss with respect to one element, e.g. $\pderiv{L}{\vec{w}_{1,1}}$ and extrapolate from there how to directly obtain the VJP.

Activation functions¶

(Leaky) ReLU¶

ReLU, or rectified linear unit is a very common activation function in deep learning architectures. In it's most standard form, as we'll implement here, it has no parameters.

We'll first implement the "leaky" version, defined as

$$ \mathrm{relu}(\vec{x}) = \max(\alpha\vec{x},\vec{x}), \ 0\leq\alpha<1 $$

This is similar to the ReLU activation we've seen in class, only that it has a small non-zero slope then it's input is negative. Note that it's not strictly differentiable, however it has sub-gradients, defined separately any positive-valued input and for negative-valued input.

TODO: Complete the implementation of the LeakyReLU class in the hw2/layers.py module.

In [4]:
N = 100
in_features = 200
num_classes = 10
eps = 1e-6
In [5]:
# Test LeakyReLU
alpha = 0.1
lrelu = layers.LeakyReLU(alpha=alpha)
x_test = torch.randn(N, in_features)

# Test forward pass
z = lrelu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.nn.LeakyReLU(alpha)(x_test), atol=eps))

# Test backward pass
test_block_grad(lrelu, x_test)
Comparing gradients... 
input    diff=0.000

Now using the LeakyReLU, we can trivially define a regular ReLU block as a special case.

TODO: Complete the implementation of the ReLU class in the hw2/layers.py module.

In [6]:
# Test ReLU
relu = layers.ReLU()
x_test = torch.randn(N, in_features)

# Test forward pass
z = relu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.relu(x_test), atol=eps))

# Test backward pass
test_block_grad(relu, x_test)
Comparing gradients... 
input    diff=0.000

Sigmoid¶

The sigmoid function $\sigma(x)$ is also sometimes used as an activation function. We have also seen it previously in the context of logistic regression.

The sigmoid function is defined as

$$ \sigma(\vec{x}) = \frac{1}{1+\mathbf{e}^{-\vec{x}}}. $$

In [7]:
# Test Sigmoid
sigmoid = layers.Sigmoid()
x_test = torch.randn(N, in_features, in_features) # 3D input should work

# Test forward pass
z = sigmoid(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.sigmoid(x_test), atol=eps))

# Test backward pass
test_block_grad(sigmoid, x_test)
Comparing gradients... 
input    diff=0.000

Hyperbolic Tangent¶

The hyperbolic tangent function $\tanh(x)$ is a common activation function used when the output should be in the range [-1, 1].

The tanh function is defined as

$$ \tanh(\vec{x}) = \frac{\mathbf{e}^x-\mathbf{e}^{-x}}{\mathbf{e}^x+\mathbf{e}^{-\vec{x}}}. $$

In [8]:
# Test TanH
tanh = layers.TanH()
x_test = torch.randn(N, in_features, in_features) # 3D input should work

# Test forward pass
z = tanh(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.tanh(x_test), atol=eps))

# Test backward pass
test_block_grad(tanh, x_test)
Comparing gradients... 
input    diff=0.000

Linear (fully connected) layer¶

First, we'll implement an affine transform layer, also known as a fully connected layer.

Given an input $\mat{X}$ the layer computes,

$$ \mat{Z} = \mat{X} \mattr{W} + \vec{b} ,~ \mat{X}\in\set{R}^{N\times D_{\mathrm{in}}},~ \mat{W}\in\set{R}^{D_{\mathrm{out}}\times D_{\mathrm{in}}},~ \vec{b}\in\set{R}^{D_{\mathrm{out}}}. $$

Notes:

  • We write it this way to follow the implementation conventions.
  • $N$ is the number of samples in the input (batch size). The input $\mat{X}$ will always be a tensor containing a batch dimension first.
  • Thanks to broadcasting, $\vec{b}$ can remain a vector even though the input $\mat{X}$ is a matrix.

TODO: Complete the implementation of the Linear class in the hw2/layers.py module.

In [9]:
# Test Linear
out_features = 1000
fc = layers.Linear(in_features, out_features)
x_test = torch.randn(N, in_features)

# Test forward pass
z = fc(x_test)
test.assertSequenceEqual(z.shape, [N, out_features])
torch_fc = torch.nn.Linear(in_features, out_features,bias=True)
torch_fc.weight = torch.nn.Parameter(fc.w)
torch_fc.bias = torch.nn.Parameter(fc.b)
test.assertTrue(torch.allclose(torch_fc(x_test), z, atol=eps))

# Test backward pass
test_block_grad(fc, x_test)

# Test second backward pass
x_test = torch.randn(N, in_features)
z = fc(x_test)
z = fc(x_test)
test_block_grad(fc, x_test)
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000

Cross-Entropy Loss¶

As you know by know, cross-entropy is a common loss function for classification tasks. In class, we defined it as

$$\ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) = - {\vectr{y}} \log(\hat{\vec{y}})$$

where $\hat{\vec{y}} = \mathrm{softmax}(x)$ is a probability vector (the output of softmax on the class scores $\vec{x}$) and the vector $\vec{y}$ is a 1-hot encoded class label.

However, it's tricky to compute the gradient of softmax, so instead we'll define a version of cross-entropy that produces the exact same output but works directly on the class scores $\vec{x}$.

We can write, $$\begin{align} \ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) &= - {\vectr{y}} \log(\hat{\vec{y}}) = - {\vectr{y}} \log\left(\mathrm{softmax}(\vec{x})\right) \\ &= - {\vectr{y}} \log\left(\frac{e^{\vec{x}}}{\sum_k e^{x_k}}\right) \\ &= - \log\left(\frac{e^{x_y}}{\sum_k e^{x_k}}\right) \\ &= - \left(\log\left(e^{x_y}\right) - \log\left(\sum_k e^{x_k}\right)\right)\\ &= - x_y + \log\left(\sum_k e^{x_k}\right) \end{align}$$

Where the scalar $y$ is the correct class label, so $x_y$ is the correct class score.

Note that this version of cross entropy is also what's provided by PyTorch's nn module.

TODO: Complete the implementation of the CrossEntropyLoss class in the hw2/layers.py module.

In [10]:
# Test CrossEntropy
cross_entropy = layers.CrossEntropyLoss()
scores = torch.randn(N, num_classes)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)

# Test forward pass
loss = cross_entropy(scores, labels)
expected_loss = torch.nn.functional.cross_entropy(scores, labels)
test.assertLess(torch.abs(expected_loss-loss).item(), 1e-5)
print('loss=', loss.item())

# Test backward pass
test_block_grad(cross_entropy, scores, y=labels)
loss= 2.7283618450164795
Comparing gradients... 
input    diff=0.000

Building Models¶

Now that we have some working Layers, we can build an MLP model of arbitrary depth and compute end-to-end gradients.

First, lets copy an idea from PyTorch and implement our own version of the nn.Sequential Module. This is a Layer which contains other Layers and calls them in sequence. We'll use this to build our MLP model.

TODO: Complete the implementation of the Sequential class in the hw2/layers.py module.

In [11]:
# Test Sequential
# Let's create a long sequence of layers and see
# whether we can compute end-to-end gradients of the whole thing.

seq = layers.Sequential(
    layers.Linear(in_features, 100),
    layers.Linear(100, 200),
    layers.Linear(200, 100),
    layers.ReLU(),
    layers.Linear(100, 500),
    layers.LeakyReLU(alpha=0.01),
    layers.Linear(500, 200),
    layers.ReLU(),
    layers.Linear(200, 500),
    layers.LeakyReLU(alpha=0.1),
    layers.Linear(500, 1),
    layers.Sigmoid(),
)
x_test = torch.randn(N, in_features)

# Test forward pass
z = seq(x_test)
test.assertSequenceEqual(z.shape, [N, 1])

# Test backward pass
test_block_grad(seq, x_test)
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
param#03 diff=0.000
param#04 diff=0.000
param#05 diff=0.000
param#06 diff=0.000
param#07 diff=0.000
param#08 diff=0.000
param#09 diff=0.000
param#10 diff=0.000
param#11 diff=0.000
param#12 diff=0.000
param#13 diff=0.000
param#14 diff=0.000

Now, equipped with a Sequential, all we have to do is create an MLP architecture. We'll define our MLP with the following hyperparameters:

  • Number of input features, $D$.
  • Number of output classes, $C$.
  • Sizes of hidden layers, $h_1,\dots,h_L$.

So the architecture will be:

FC($D$, $h_1$) $\rightarrow$ ReLU $\rightarrow$ FC($h_1$, $h_2$) $\rightarrow$ ReLU $\rightarrow$ $\cdots$ $\rightarrow$ FC($h_{L-1}$, $h_L$) $\rightarrow$ ReLU $\rightarrow$ FC($h_{L}$, $C$)

We'll also create a sequence of the above MLP and a cross-entropy loss, since it's the gradient of the loss that we need in order to train a model.

TODO: Complete the implementation of the MLP class in the hw2/layers.py module. Ignore the dropout parameter for now.

In [12]:
# Create an MLP model
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100])
print(mlp)
MLP, Sequential
	[0] Linear(self.in_features=200, self.out_features=100)
	[1] ReLU
	[2] Linear(self.in_features=100, self.out_features=50)
	[3] ReLU
	[4] Linear(self.in_features=50, self.out_features=100)
	[5] ReLU
	[6] Linear(self.in_features=100, self.out_features=10)

In [13]:
# Test MLP architecture
N = 100
in_features = 10
num_classes = 10
for activation in ('relu', 'sigmoid'):
    mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100], activation=activation)
    test.assertEqual(len(mlp.sequence), 7)
    
    num_linear = 0
    for b1, b2 in zip(mlp.sequence, mlp.sequence[1:]):
        if (str(b2).lower() == activation):
            test.assertTrue(str(b1).startswith('Linear'))
            num_linear += 1
            
    test.assertTrue(str(mlp.sequence[-1]).startswith('Linear'))
    test.assertEqual(num_linear, 3)

    # Test MLP gradients
    # Test forward pass
    x_test = torch.randn(N, in_features)
    labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
    z = mlp(x_test)
    test.assertSequenceEqual(z.shape, [N, num_classes])

    # Create a sequence of MLPs and CE loss
    seq_mlp = layers.Sequential(mlp, layers.CrossEntropyLoss())
    loss = seq_mlp(x_test, y=labels)
    test.assertEqual(loss.dim(), 0)
    print(f'MLP loss={loss}, activation={activation}')

    # Test backward pass
    test_block_grad(seq_mlp, x_test, y=labels)
MLP loss=2.30924391746521, activation=relu
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
param#03 diff=0.000
param#04 diff=0.000
param#05 diff=0.000
param#06 diff=0.000
param#07 diff=0.000
param#08 diff=0.000
MLP loss=2.3934404850006104, activation=sigmoid
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
param#03 diff=0.000
param#04 diff=0.000
param#05 diff=0.000
param#06 diff=0.000
param#07 diff=0.000
param#08 diff=0.000

If the above tests passed then congratulations - you've now implemented an arbitrarily deep model and loss function with end-to-end automatic differentiation!

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.

In [14]:
from cs236781.answers import display_answer
import hw2.answers

Question 1¶

Suppose we have a linear (i.e. fully-connected) layer with a weight tensor $\mat{W}$, defined with in_features=1024 and out_features=512. We apply this layer to an input tensor $\mat{X}$ containing a batch of N=64 samples. The output of the layer is denoted as $\mat{Y}$.

  1. Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{X}}$ of the output of the layer w.r.t. the input $\mat{X}$.
    1. What is the shape of this tensor?
    2. Considering the above Jacobian as a 2D block matrix with blocks of shape (out_features × in_features), describe its structure. What can you definitively say about it?
    3. Apart from VJP and considering the above analysis, is there an optimization that can be done? If so, what is the new tensor shape? If not, explain why.
    4. Given the gradient of the output w.r.t. some downstream scalar loss $L$, $\delta\mat{Y} := \pderiv{L}{\mat{Y}}$, show how to calculate the downstream gradient w.r.t. the input ($\delta\mat{X}$) without materializing the Jacobian.
    5. Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{W}}$ of the output of the layer w.r.t. the layer weights $\mat{W}$. What is the shape of this tensor? What is the shape of the blocks if you were to make it into a block matrix?
In [15]:
display_answer(hw2.answers.part1_q1)

1

1.1 Full Jacobian Shape:

The total number of elements in $\mat{X}$ is $N \times D_{\text{in}} = 64 \times 1024 = 65,536$.

The total number of elements in $\mat{Y}$ is $N \times D_{\text{out}} = 64 \times 512 = 32,768$.

The shape of the full Jacobian $\frac{\partial \mat{Y}}{\partial \mat{X}}$ is:

$$(N \times D_{\text{out}}) \times (N \times D_{\text{in}}) = (32,768) \times (65,536)$$

1.2 Block Matrix Structure:

When viewed as a block matrix with blocks of shape $(D_{\text{out}} \times D_{\text{in}}) = (512 \times 1024)$:

The Jacobian is a diagonal block matrix. The structure is:

  • Diagonal blocks (where $i = j$): Each diagonal block equals $\mat{W}^T$ because sample $i$'s output depends only on sample $i$'s input $\text{s.t. } y_i = x_i W^T$
  • Off-diagonal blocks (where $i \neq j$): All zeros, because the output of sample $i$ is independent of the input of sample $j$

1.3 Optimization:

Because the Jacobian is a diagonal block matrix whose diagonal blocks are all identical (each block equals $\mat{W}^\top$), we do not need to store every block separately. Instead, we can store a single copy of the weight matrix.

  • Optimized storage: one copy of the weight matrix $\mat{W}$ (or $\mat{W}^\top$) with shape $(512) \times (1024)$.

This reduces storage from the full Jacobian of shape $(64 \times 512 \times 64 \times 1024)$ down to $(512\times1024)$, which is $64^2$ times smaller.

1.4 Computing Gradient Without Materializing Jacobian: Given $\delta\mathbf{Y}\in\mathbb{R}^{N\times512}$, compute per-sample $$\delta\mathbf{x}^{(i)} = \delta\mathbf{y}^{(i)}\,\mathbf{W}\quad((1,512)\cdot(512,1024)=(1,1024))$$

Vectorized:

$$ \delta X = \begin{bmatrix} \delta x^{(1)} \\ \delta x^{(2)} \\ \vdots \\ \delta x^{(N)} \end{bmatrix} = \begin{bmatrix} \delta y^{(1)} \\ \delta y^{(2)} \\ \vdots \\ \delta y^{(N)} \end{bmatrix} W = \delta Y\,W $$

Pass the matrix $\delta X$ to the previous layer's backward as its downstream gradient to continue propagation.

1.5 Jacobian w.r.t. Weights: For the Jacobian $\frac{\partial \mat{Y}}{\partial \mat{W}}$:

  • Full Jacobian shape:

$$ (N\cdot D_{\mathrm{out}}) \times (D_{\mathrm{out}}\cdot D_{\mathrm{in}}) = (32{,}768) \times (524{,}288) = 17,179,869,184. $$

  • Block shape (if arranged as blocks):

$$ (D_{\mathrm{out}}\times D_{\mathrm{in}}) = (512\times 1024). $$

Explanation: For a single sample $i$ and output unit $p$ we have $y_{i,p}=\sum_r W_{p,r}x_{i,r}$, so differentiating w.r.t. the $p$-th row of $W$ yields the input row $x^{(i)}$. Thus each output's derivative places the vector $x^{(i)}$ in the corresponding output-row, producing blocks of size $D_{\mathrm{out}}\times D_{\mathrm{in}}$.

Question 2¶

Can the second order derivative be helpful when optimizing using gradient descent? explain why not or describe a scenario where it's useful.

In [16]:
display_answer(hw2.answers.part1_q2)

Second-Order Derivatives in Gradient Descent

The second-order derivative (Hessian) can be helpful in optimization, but it depends on the context:

Why NOT always helpful:

  1. Computational Cost: Computing the full Hessian matrix is extremely expensive. For a network with $n$ parameters, the Hessian is $n \times n$, requiring $\mathcal{O}(n^2)$ memory and $\mathcal{O}(n^3)$ time for eigendecomposition.

  2. Storage: For modern deep networks with millions of parameters, storing the full Hessian is infeasible.

When Second-Order Derivatives ARE Helpful:

Newton's Method: Uses the inverse Hessian to determine step direction: $$\vec{\theta}_{t+1} = \vec{\theta}_t - \mathcal{H}^{-1} \nabla L$$ Mult by $\mathcal{H}^{-1}$ automatically adjusts the step size, taking larger steps in flat regions (low curvature) and smaller steps in steep regions (high curvature), based on the function's second derivative.

In [ ]:
 
In [ ]:
 

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 2: Optimization and Training¶

In this part we will learn how to implement optimization algorithms for deep networks. Additionally, we'll learn how to write training loops and implement a modular model trainer. We'll use our optimizers and training code to test a few configurations for classifying images with an MLP model.

In [1]:
import os
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf

%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
seed = 42
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()

Implementing Optimization Algorithms¶

In the context of deep learning, an optimization algorithm is some method of iteratively updating model parameters so that the loss converges toward some local minimum (which we hope will be good enough).

Gradient descent-based methods are by far the most popular algorithms for optimization of neural network parameters. However the high-dimensional loss-surfaces we encounter in deep learning applications are highly non-convex. They may be riddled with local minima, saddle points, large plateaus and a host of very challenging "terrain" for gradient-based optimization. This gave rise to many different methods of performing the parameter updates based on the loss gradients, aiming to tackle these optimization challenges.

The most basic gradient-based update rule can be written as,

$$ \vec{\theta} \leftarrow \vec{\theta} - \eta \nabla_{\vec{\theta}} L(\vec{\theta}; \mathcal{D}) $$

where $\mathcal{D} = \left\{ (\vec{x}^i, \vec{y}^i) \right\}_{i=1}^{M}$ is our training dataset or part of it. Specifically, if we have in total $N$ training samples, then

  • If $M=N$ this is known as regular gradient descent. If the dataset does not fit in memory the gradient of this loss becomes infeasible to compute.
  • If $M=1$, the loss is computed w.r.t. a single different sample each time. This is known as stochastic gradient descent.
  • If $1<M<N$ this is known as stochastic mini-batch gradient descent. This is the most commonly-used option.

The intuition behind gradient descent is simple: since the gradient of a multivariate function points to the direction of steepest ascent ("uphill"), we move in the opposite direction. A small step size $\eta$ known as the learning rate is required since the gradient can only serve as a first-order linear approximation of the function's behaviour at $\vec{\theta}$ (recall e.g. the Taylor expansion). However in truth our loss surface generally has nontrivial curvature caused by a high order nonlinear dependency on $\vec{\theta}$. Thus taking a large step in the direction of the gradient is actually just as likely to increase the function value.

No description has been provided for this image

The idea behind the stochastic versions is that by constantly changing the samples we compute the loss with, we get a dynamic error surface, i.e. it's different for each set of training samples. This is thought to generally improve the optimization since it may help the optimizer get out of flat regions or sharp local minima since these features may disappear in the loss surface of subsequent batches. The image below illustrates this. The different lines are different 1-dimensional losses for different training set-samples.

No description has been provided for this image

Deep learning frameworks generally provide implementations of various gradient-based optimization algorithms. Here we'll implement our own optimization module from scratch, this time keeping a similar API to the PyTorch optim package.

We define a base Optimizer class. An optimizer holds a set of parameter tensors (these are the trainable parameters of some model) and maintains internal state. It may be used as follows:

  • After the forward pass has been performed the optimizer's zero_grad() function is invoked to clear the parameter gradients computed by previous iterations.
  • After the backward pass has been performed, and gradients have been calculated for these parameters, the optimizer's step() function is invoked in order to update the value of each parameter based on it's gradient.

The exact method of update is implementation-specific for each optimizer and may depend on its internal state. In addition, adding the regularization penalty to the gradient is handled by the optimizer since it only depends on the parameter values (and not the data).

Here's the API of our Optimizer:

In [3]:
import hw2.optimizers as optimizers
help(optimizers.Optimizer)
Help on class Optimizer in module hw2.optimizers:

class Optimizer(abc.ABC)
 |  Optimizer(params)
 |  
 |  Base class for optimizers.
 |  
 |  Method resolution order:
 |      Optimizer
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, params)
 |      :param params: A sequence of model parameters to optimize. Can be a
 |      list of (param,grad) tuples as returned by the Layers, or a list of
 |      pytorch tensors in which case the grad will be taken from them.
 |  
 |  step(self)
 |      Updates all the registered parameter values based on their gradients.
 |  
 |  zero_grad(self)
 |      Sets the gradient of the optimized parameters to zero (in place).
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  params
 |      :return: A sequence of parameter tuples, each tuple containing
 |      (param_data, param_grad). The data should be updated in-place
 |      according to the grad.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset({'step'})

Vanilla SGD with Regularization¶

Let's start by implementing the simplest gradient based optimizer. The update rule will be exacly as stated above, but we'll also add a L2-regularization term to the gradient. Remember that in the loss function, the L2 regularization term is expressed by

$$R(\vec{\theta}) = \frac{1}{2}\lambda||\vec{\theta}||^2_2.$$

TODO: Complete the implementation of the VanillaSGD class in the hw2/optimizers.py module.

In [4]:
# Test VanillaSGD
torch.manual_seed(42)
p = torch.randn(500, 10)
dp = torch.randn(*p.shape)*2
params = [(p, dp)]

vsgd = optimizers.VanillaSGD(params, learn_rate=0.5, reg=0.1)
vsgd.step()

expected_p = torch.load('tests/assets/expected_vsgd.pt')
diff = torch.norm(p-expected_p).item()
print(f'diff={diff}')
test.assertLess(diff, 1e-3)
diff=1.0932822078757454e-06

Training¶

Now that we can build a model and loss function, compute their gradients and we have an optimizer, we can finally do some training!

In the spirit of more modular software design, we'll implement a class that will aid us in automating the repetitive training loop code that we usually write over and over again. This will be useful for both training our Layer-based models and also later for training PyTorch nn.Modules.

Here's our Trainer API:

In [5]:
import hw2.training as training
help(training.Trainer)
Help on class Trainer in module hw2.training:

class Trainer(abc.ABC)
 |  Trainer(model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
 |  
 |  A class abstracting the various tasks of training models.
 |  
 |  Provides methods at multiple levels of granularity:
 |  - Multiple epochs (fit)
 |  - Single epoch (train_epoch/test_epoch)
 |  - Single batch (train_batch/test_batch)
 |  
 |  Method resolution order:
 |      Trainer
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
 |      Initialize the trainer.
 |      :param model: Instance of the model to train.
 |      :param device: torch.device to run training on (CPU or GPU).
 |  
 |  fit(self, dl_train: torch.utils.data.dataloader.DataLoader, dl_test: torch.utils.data.dataloader.DataLoader, num_epochs: int, checkpoints: str = None, early_stopping: int = None, print_every: int = 1, **kw) -> cs236781.train_results.FitResult
 |      Trains the model for multiple epochs with a given training set,
 |      and calculates validation loss over a given validation set.
 |      :param dl_train: Dataloader for the training set.
 |      :param dl_test: Dataloader for the test set.
 |      :param num_epochs: Number of epochs to train for.
 |      :param checkpoints: Whether to save model to file every time the
 |          test set accuracy improves. Should be a string containing a
 |          filename without extension.
 |      :param early_stopping: Whether to stop training early if there is no
 |          test loss improvement for this number of epochs.
 |      :param print_every: Print progress every this number of epochs.
 |      :return: A FitResult object containing train and test losses per epoch.
 |  
 |  save_checkpoint(self, checkpoint_filename: str)
 |      Saves the model in it's current state to a file with the given name (treated
 |      as a relative path).
 |      :param checkpoint_filename: File name or relative path to save to.
 |  
 |  test_batch(self, batch) -> cs236781.train_results.BatchResult
 |      Runs a single batch forward through the model and calculates loss.
 |      :param batch: A single batch of data  from a data loader (might
 |          be a tuple of data and labels or anything else depending on
 |          the underlying dataset.
 |      :return: A BatchResult containing the value of the loss function and
 |          the number of correctly classified samples in the batch.
 |  
 |  test_epoch(self, dl_test: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
 |      Evaluate model once over a test set (single epoch).
 |      :param dl_test: DataLoader for the test set.
 |      :param kw: Keyword args supported by _foreach_batch.
 |      :return: An EpochResult for the epoch.
 |  
 |  train_batch(self, batch) -> cs236781.train_results.BatchResult
 |      Runs a single batch forward through the model, calculates loss,
 |      preforms back-propagation and updates weights.
 |      :param batch: A single batch of data  from a data loader (might
 |          be a tuple of data and labels or anything else depending on
 |          the underlying dataset.
 |      :return: A BatchResult containing the value of the loss function and
 |          the number of correctly classified samples in the batch.
 |  
 |  train_epoch(self, dl_train: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
 |      Train once over a training set (single epoch).
 |      :param dl_train: DataLoader for the training set.
 |      :param kw: Keyword args supported by _foreach_batch.
 |      :return: An EpochResult for the epoch.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset({'test_batch', 'train_batch'})

The Trainer class splits the task of training (and evaluating) models into three conceptual levels,

  • Multiple epochs - the fit method, which returns a FitResult containing losses and accuracies for all epochs.
  • Single epoch - the train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.
  • Single batch - the train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.

It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.

The first thing we should do in order to verify our model, gradient calculations and optimizer implementation is to try to overfit a large model (many parameters) to a small dataset (few images). This will show us that things are working properly.

Let's begin by loading the CIFAR-10 dataset.

In [6]:
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())

print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
Files already downloaded and verified
Files already downloaded and verified
Train: 50000 samples
Test: 10000 samples

Now, let's implement just a small part of our training logic since that's what we need right now.

TODO:

  1. Complete the implementation of the train_batch() method in the LayerTrainer class within the hw2/training.py module.
  2. Update the hyperparameter values in the part2_overfit_hp() function in the hw2/answers.py module. Tweak the hyperparameter values until your model overfits a small number of samples in the code block below. You should get 100% accuracy within a few epochs.

The following code block will use your custom Layer-based MLP implentation, custom Vanilla SGD and custom trainer to overfit the data. The classification accuracy should be 100% within a few epochs.

In [7]:
import hw2.layers as layers
import hw2.answers as answers
from torch.utils.data import DataLoader

# Overfit to a very small dataset of 20 samples
batch_size = 10
max_batches = 2
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)

# Get hyperparameters
hp = answers.part2_overfit_hp()

torch.manual_seed(seed)

# Build a model and loss using our custom MLP and CE implementations
model = layers.MLP(3*32*32, num_classes=10, hidden_features=[128]*3, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()

# Use our custom optimizer
optimizer = optimizers.VanillaSGD(model.params(), learn_rate=hp['lr'], reg=hp['reg'])

# Run training over small dataset multiple times
trainer = training.LayerTrainer(model, loss_fn, optimizer)
best_acc = 0
for i in range(20):
    res = trainer.train_epoch(dl_train, max_batches=max_batches)
    best_acc = res.accuracy if res.accuracy > best_acc else best_acc
    
test.assertGreaterEqual(best_acc, 98)

Now that we know training works, let's try to fit a model to a bit more data for a few epochs, to see how well we're doing. First, we need a function to plot the FitResults object.

In [8]:
from cs236781.plot import plot_fit
plot_fit?

TODO:

  1. Complete the implementation of the test_batch() method in the LayerTrainer class within the hw2/training.py module.
  2. Implement the fit() method of the Trainer class within the hw2/training.py module.
  3. Tweak the hyperparameters for this section in the part2_optim_hp() function in the hw2/answers.py module.
  4. Run the following code blocks to train. Try to get above 35-40% test-set accuracy.
In [9]:
# Define a larger part of the CIFAR-10 dataset (still not the whole thing)
batch_size = 50
max_batches = 100
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size//2, shuffle=False)
In [10]:
# Define a function to train a model with our Trainer and various optimizers
def train_with_optimizer(opt_name, opt_class, fig):
    torch.manual_seed(seed)
    
    # Get hyperparameters
    hp = answers.part2_optim_hp()
    hidden_features = [128] * 5
    num_epochs = 10
    
    # Create model, loss and optimizer instances
    model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'])
    loss_fn = layers.CrossEntropyLoss()
    optimizer = opt_class(model.params(), learn_rate=hp[f'lr_{opt_name}'], reg=hp['reg'])

    # Train with the Trainer
    trainer = training.LayerTrainer(model, loss_fn, optimizer)
    fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches)
    
    fig, axes = plot_fit(fit_res, fig=fig, legend=opt_name)
    return fig
In [11]:
fig_optim = None
fig_optim = train_with_optimizer('vanilla', optimizers.VanillaSGD, fig_optim)
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
No description has been provided for this image

Momentum¶

The simple vanilla SGD update is rarely used in practice since it's very slow to converge relative to other optimization algorithms.

One reason is that naïvely updating in the direction of the current gradient causes it to fluctuate wildly in areas where the loss surface in some dimensions is much steeper than in others. Another reason is that using the same learning rate for all parameters is not a great idea since not all parameters are created equal. For example, parameters associated with rare features should be updated with a larger step than ones associated with commonly-occurring features because they'll get less updates through the gradients.

Therefore more advanced optimizers take into account the previous gradients of a parameter and/or try to use a per-parameter specific learning rate instead of a common one.

Let's now implement a simple and common optimizer: SGD with Momentum. This optimizer takes previous gradients of a parameter into account when updating it's value instead of just the current one. In practice it usually provides faster convergence than the vanilla SGD.

The SGD with Momentum update rule can be stated as follows: $$\begin{align} \vec{v}_{t+1} &= \mu \vec{v}_t - \eta \delta \vec{\theta}_t \\ \vec{\theta}_{t+1} &= \vec{\theta}_t + \vec{v}_{t+1} \end{align}$$

Where $\eta$ is the learning rate, $\vec{\theta}$ is a model parameter, $\delta \vec{\theta}_t=\pderiv{L}{\vec{\theta}}(\vec{\theta}_t)$ is the gradient of the loss w.r.t. to the parameter and $0\leq\mu<1$ is a hyperparameter known as momentum.

Expanding the update rule recursively shows us now the parameter update infact depends on all previous gradient values for that parameter, where the old gradients are exponentially decayed by a factor of $\mu$ at each timestep.

Since we're incorporating previous gradient (update directions), a noisy value of the current gradient will have less effect so that the general direction of previous updates is maintained somewhat. The following figure illustrates this.

No description has been provided for this image

TODO:

  1. Complete the implementation of the MomentumSGD class in the hw2/optimizers.py module.
  2. Tweak the learning rate for momentum in part2_optim_hp() the function in the hw2/answers.py module.
  3. Run the following code block to compare to the vanilla SGD.
In [12]:
fig_optim = train_with_optimizer('momentum', optimizers.MomentumSGD, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Out[12]:
No description has been provided for this image

Bonus: RMSProp¶

This is another optmizer that accounts for previous gradients, but this time it uses them to adapt the learning rate per parameter.

RMSProp maintains a decaying moving average of previous squared gradients, $$ \vec{r}_{t+1} = \gamma\vec{r}_{t} + (1-\gamma)\delta\vec{\theta}_t^2 $$ where $0<\gamma<1$ is a decay constant usually set close to $1$, and $\delta\vec{\theta}_t^2$ denotes element-wise squaring.

The update rule for each parameter is then, $$ \vec{\theta}_{t+1} = \vec{\theta}_t - \left( \frac{\eta}{\sqrt{r_{t+1}+\varepsilon}} \right) \delta\vec{\theta}_t $$

where $\varepsilon$ is a small constant to prevent numerical instability. The idea here is to decrease the learning rate for parameters with high gradient values and vice-versa. The decaying moving average prevents accumulating all the past gradients which would cause the effective learning rate to become zero.

Bonus:

  1. Complete the implementation of the RMSProp class in the hw2/optimizers.py module.
  2. Tweak the learning rate for RMSProp in part2_optim_hp() the function in the hw2/answers.py module.
  3. Run the following code block to compare to the other optimizers.
In [13]:
fig_optim = train_with_optimizer('rmsprop', optimizers.RMSProp, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Out[13]:
No description has been provided for this image

Note that you should get better train/test accuracy with Momentum and RMSProp than Vanilla.

Dropout Regularization¶

Dropout is a useful technique to improve generalization of deep models.

The idea is simple: during the forward pass drop, i.e. set to to zero, the activation of each neuron, with a probability of $p$. For example, if $p=0.4$ this means we drop the activations of 40% of the neurons (on average).

There are a few important things to note about dropout:

  1. It is only performed during training. When testing our model the dropout layers should be a no-op.
  2. In the backward pass, gradients are only propagated back into neurons that weren't dropped during the forward pass.
  3. During testing, the activations must be scaled since the expected value of each neuron during the training phase is now $1-p$ times it's original expectation. Thus, we need to scale the test-time activations by $1-p$ to match. Equivalently, we can scale the train time activations by $1/(1-p)$.

TODO:

  1. Complete the implementation of the Dropout class in the hw2/layers.py module.
  2. Finish the implementation of the MLP's __init__() method in the hw2/layers.py module. If dropout>0 you should add a Dropout layer after each ReLU.
In [14]:
from hw2.grad_compare import compare_layer_to_torch

# Check architecture of MLP with dropout layers
mlp_dropout = layers.MLP(in_features, num_classes, [50]*3, dropout=0.6)
print(mlp_dropout)
test.assertEqual(len(mlp_dropout.sequence), 10)
for b1, b2 in zip(mlp_dropout.sequence, mlp_dropout.sequence[1:]):
    if str(b1).lower() == 'relu':
        test.assertTrue(str(b2).startswith('Dropout'))
test.assertTrue(str(mlp_dropout.sequence[-1]).startswith('Linear'))
MLP, Sequential
	[0] Linear(self.in_features=3072, self.out_features=50)
	[1] ReLU
	[2] Dropout(p=0.6)
	[3] Linear(self.in_features=50, self.out_features=50)
	[4] ReLU
	[5] Dropout(p=0.6)
	[6] Linear(self.in_features=50, self.out_features=50)
	[7] ReLU
	[8] Dropout(p=0.6)
	[9] Linear(self.in_features=50, self.out_features=10)

In [15]:
# Test end-to-end gradient in train and test modes.
print('Dropout, train mode')
mlp_dropout.train(True)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
    test.assertLess(diff, 1e-3)
    
print('Dropout, test mode')
mlp_dropout.train(False)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
    test.assertLess(diff, 1e-3)
Dropout, train mode
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
param#03 diff=0.000
param#04 diff=0.000
param#05 diff=0.000
param#06 diff=0.000
param#07 diff=0.000
param#08 diff=0.000
Dropout, test mode
Comparing gradients... 
input    diff=0.000
param#01 diff=0.000
param#02 diff=0.000
param#03 diff=0.000
param#04 diff=0.000
param#05 diff=0.000
param#06 diff=0.000
param#07 diff=0.000
param#08 diff=0.000

To see whether dropout really improves generalization, let's take a small training set (small enough to overfit) and a large test set and check whether we get less overfitting and perhaps improved test-set accuracy when using dropout.

In [16]:
# Define a small set from CIFAR-10, but take a larger test set since we want to test generalization
batch_size = 10
max_batches = 40
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size*2, shuffle=False)

TODO: Tweak the hyperparameters for this section in the part2_dropout_hp() function in the hw2/answers.py module. Try to set them so that the first model (with dropout=0) overfits. You can disable the other dropout options until you tune the hyperparameters. We can then see the effect of dropout for generalization.

In [17]:
# Get hyperparameters
hp = answers.part2_dropout_hp()
hidden_features = [400] * 1
num_epochs = 30
In [18]:
torch.manual_seed(seed)
fig=None
#for dropout in [0]:  # Use this for tuning the hyperparms until you overfit
for dropout in [0, 0.4, 0.8]:
    model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'], dropout=dropout)
    loss_fn = layers.CrossEntropyLoss()
    optimizer = optimizers.MomentumSGD(model.params(), learn_rate=hp['lr'], reg=0)

    print('*** Training with dropout=', dropout)
    trainer = training.LayerTrainer(model, loss_fn, optimizer)
    fit_res_dropout = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches, print_every=6)
    fig, axes = plot_fit(fit_res_dropout, fig=fig, legend=f'dropout={dropout}', log_loss=True)
*** Training with dropout= 0
--- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.4
--- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.8
--- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
No description has been provided for this image

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.

In [19]:
from cs236781.answers import display_answer
import hw2.answers

Question 1¶

Regarding the graphs you got for the three dropout configurations:

  1. Explain the graphs of no-dropout vs dropout. Do they match what you expected to see?

    • If yes, explain why and provide examples based on the graphs.
    • If no, explain what you think the problem is and what should be modified to fix it.
  2. Compare the low-dropout setting to the high-dropout setting and explain based on your graphs.

In [20]:
display_answer(hw2.answers.part2_q1)

Your answer:

  1. First, we can observe that the model without dropout suffers from overfitting. While the training accuracy is almost 100% and training loss is close to zero, the test loss explodes and test accuracy is getting worse after the first few epochs. This matches the expectation that without regularization, the model merely memorizes the training data rather than generalizing.

    The graphs for no-dropout vs. dropout match the intuition that dropout acts as a regularizer. Adding dropout (dropout=0.4) successfully prevents this overfitting. While the training accuracy for dropout=0.4 is lower than the no-dropout case (as expected, since it makes learning harder), the test loss remains stable and low, and it achieves the highest test accuracy of the three configurations.

    In addition, the train_loss graph shows that adding dropout leads to slower convergence and higher final training loss compared to the no-dropout model, which is the expected trade-off for better generalization.

    Lastly, note that when the dropout is too big, as in dropout=0.8, our model is very weak and not learns a lot, resulting in underfitting.

  2. Let's compare the low-dropout setting to the high-dropout setting.

    Dropout is a part of regularization. Regularization encourages better generalization. But, adding "too much" might make it very hard for the model to learn and therefore cause underfitting. As a result, when adding regularization we would want to be in the "sweet spot" between under and over fitting.

    We can see from the graphs, that the model with dropout=0.8 result in bad losses and accuracies, both in the train and test sets. That because it adds too much regularization, as explained above. Also we can see that the loss function of that model does not decrease. That strengthens our intuition that it is very hard for the model to learn in that condition.

    In comparison, the model with dropout=0.4 have higher accuracies in the train and test sets. Since the training and test accuracies are similar across epochs, we can deduce that the model do not overfit. Also, the loss functions in the train and test sets seems to have a gradually improve over the epochs, which the regularization is not too strong.

Question 2¶

When training a model with the cross-entropy loss function, is it possible for the test loss to decrease for a few epochs while the test accuracy also decreases?

If it's possible explain how, if it's not explain why not.

In [21]:
display_answer(hw2.answers.part2_q2)

Your answer:

Yes, it is possible. We will provide an example where both the loss and the accuracy of the model decrease.

First, let's take a look at the cross entropy loss for a given batch of size N:

$ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \right] $

The cross entropy loss measures how much probability the model assigns to the true label. It is sensitive to the confidence of the prediction. Accuracy, on the other hand, depends only whether the predicted class matches the true lables.

With that intuition in mind, let's look at an example: We will look at a binary classification problem.

  • Epoch A: 9 correct predictions, with high loss. 1 incorrect prediction.

    The correct predictions are correct because they passed some threshold, but still have low confidence in the true labels.

  • Epoch B: 8 correct predictions, with low loss. 2 incorrect predictions.

    The model assign high probability to the correct predictions.

The loss decrease from A to B because there are more predictions that are close to their true labels. And the accuracy decrease as well.

Therefore it is possible for both the loss and the accuraacy to decrease.

Question 3¶

  1. Compare in detail the differences and similarities of gradient descent (GD) and stochastic gradient descent (SGD).

  2. You saw the advantages of momentum in SGD. Should you incorporate momentum into gradient descent (GD) as well?

  3. You would like to try GD to train your model instead of SGD, but you're concerned that your dataset won't fit in memory. A friend suggested that you should split the data into disjoint batches, do multiple forward passes until all data is exhausted, and then do one backward pass on the sum of the losses.

    1. Would this approach produce a gradient equivalent to GD? Why or why not? provide mathematical justification for your answer.
    2. You implemented the suggested approach, and were careful to use batch sizes small enough so that each batch fits in memory. However, after some number of batches you got an out of memory error. What happened?
    3. How can you solve the above issue
In [22]:
display_answer(hw2.answers.part2_q3)

Your answer:

  1. Let's compare the differences and similarities of GD and SGD.

    Similarities:

    • They operate in a similar way(itertive updates).
    • If they converge, it is for a local minimum in non-convex functions, and for a global minimum in convex functions.
    • Both of them use the gradient of the function to find a minimum.

    Differences:

    • GD uses the entire dataset to calculate the gradient for one iteration. SGD uses a mini-batch(by definition it uses 1 sample, but in practice it is typically a mini batch...) instead.
    • Run time in SGD is faster per iteration.
    • SGD requires more iterations to converge.
    • SGD is more noisy, because it is influanced only by the small mini-batch data.
    • The randomness in SGD can help it get out of local minimum. Therefore GD is more likely to get stuck at a local minimum.
  2. Momentum in regular GD might be helpful.

    If there is a big ratio gap between the axes of the data(the loss function surface looks like a narrow valley), then GD will suffer from "zig zag" behavior. Adding momentum will help us converge quickly to the minimum, and bypass the "zig zag".

    In addition to that, adding momentum might help us get out of small local minima. That is a simillar to the behaviour we saw in SGD with momentum.

    1. In the original GD algorithm, each iteration updates its parameters in the following way:

    $$w_{(t+1)} = w_{(t)} - \eta \nabla_w L \quad , \text{ where } \quad \nabla_w L = \nabla_w \left( \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, \hat{y}_i) \right) = \frac{1}{N} \sum_{i=1}^{N} \nabla_w \ell(y_i, \hat{y}_i)$$

    Suppose we have a disjoint split of the data $\{N_i\}_{i=1}^{m} \text{ , } N = \left| \bigcup_{i=1}^{m} N_i \right|$.

    In the suggested GD algorithm, each forward pass would calculate:

    • $\sum_{y \in N_i} \ell(y, \hat{y})$ , the sum of the loss functions for the data in batch $N_i$.
    • Each layer saves the input data it received, for the backward pass.

    Because $\sum_{i=1}^{N} \ell(y_i, \hat{y}_i) = \sum_{i=1}^{m} \sum_{y \in N_i} \ell(y, \hat{y})$, from the linearity of the gradient we would get that:

    $\nabla_w \sum_{i=1}^{N} \ell(y_i, \hat{y}_i) = \nabla_w \sum_{i=1}^{m} \sum_{y \in N_i} \ell(y, \hat{y})$.

    And therefore the gradients will be the same.

  1. The problem is that in the forward pass each layer saves the intermidiate activations in the RAM. Therefore, altough we split the data we are implicitly saving it in the RAM.

  2. We can solve this issue by automatically calculating the gradient of the batch after the forward pass, and save only that. The gradients are zeroed before processing the first batch, and parameters are updated only after all batches have been processed. Therefore for each batch we would have a single vector of the gradient and that's it. From the answer to part 1 of the current question we could deduce that it would be the same as regular GD.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 3: Binary Classification with Multilayer Perceptrons¶

In this part we'll implement a general purpose MLP and Binary Classifier using pytorch. We'll implement its training, and also learn about decision boundaries an threshold selection in the context of binary classification. Finally, we'll explore the effect of depth and width on an MLP's performance.

In [1]:
import os
import re
import sys
import glob
import unittest
from typing import Sequence, Tuple

import sklearn
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as tvtf
from torch import Tensor

%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Using device: cuda

Synthetic Dataset¶

To test our first neural network-based classifiers we'll start by creating a toy binary classification dataset, but one which is not trivial for a linear model.

In [3]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
In [4]:
def rotate_2d(X, deg=0):
    """
    Rotates each 2d sample in X of shape (N, 2) by deg degrees.
    """
    a = np.deg2rad(deg)
    return X @ np.array([[np.cos(a), -np.sin(a)],[np.sin(a), np.cos(a)]]).T

def plot_dataset_2d(X, y, n_classes=2, alpha=0.2, figsize=(8, 6), title=None, ax=None):
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=figsize)
    for c in range(n_classes):
        ax.scatter(*X[y==c,:].T, alpha=alpha, label=f"class {c}");
        
    ax.set_xlabel("$x_1$"); ax.set_ylabel("$x_2$");
    ax.legend(); ax.set_title((title or '') + f" (n={len(y)})")

We'll split our data into 80% train and validation, and 20% test. To make it a bit more challenging, we'll simulate a somewhat real-world setting where there are multiple populations, and the training/validation data is not sampled iid from the underlying data distribution.

In [5]:
np.random.seed(seed)

N = 10_000
N_train = int(N * .8)

# Create data from two different distributions for the training/validation
X1, y1 = make_moons(n_samples=N_train//2, noise=0.2)
X1 = rotate_2d(X1, deg=10)
X2, y2 = make_moons(n_samples=N_train//2, noise=0.25)
X2 = rotate_2d(X2, deg=50)

# Test data comes from a similar but noisier distribution
X3, y3 = make_moons(n_samples=(N-N_train), noise=0.3)
X3 = rotate_2d(X3, deg=40)

X, y = np.vstack([X1, X2, X3]), np.hstack([y1, y2, y3])
In [6]:
# Train and validation data is from mixture distribution
X_train, X_valid, y_train, y_valid = train_test_split(X[:N_train, :], y[:N_train], test_size=1/3, shuffle=False)

# Test data is only from the second distribution
X_test, y_test = X[N_train:, :], y[N_train:]

fig, ax = plt.subplots(1, 3, figsize=(20, 5))
plot_dataset_2d(X_train, y_train, title='Train', ax=ax[0]);
plot_dataset_2d(X_valid, y_valid, title='Validation', ax=ax[1]);
plot_dataset_2d(X_test, y_test, title='Test', ax=ax[2]);
No description has been provided for this image

Now let us create a data loader for each dataset.

In [7]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

batch_size = 32

dl_train, dl_valid, dl_test = [
    DataLoader(
        dataset=TensorDataset(
            torch.from_numpy(X_).to(torch.float32),
            torch.from_numpy(y_)
        ),
        shuffle=True,
        num_workers=0,
        batch_size=batch_size
    )
    for X_, y_ in [(X_train, y_train), (X_valid, y_valid), (X_test, y_test)]
]

print(f'{len(dl_train.dataset)=}, {len(dl_valid.dataset)=}, {len(dl_test.dataset)=}')
len(dl_train.dataset)=5333, len(dl_valid.dataset)=2667, len(dl_test.dataset)=2000

Simple MLP¶

A multilayer-perceptron is arguably a the most basic type of neural network model. It is composed of $L$ layers, each layer $l$ with $n_l$ perceptron ("neuron") units. Each perceptron is connected to all ouputs of the previous layer (or all inputs in the first layer), calculates their weighted sum, applies a linearity and produces a single output.

No description has been provided for this image

Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates:

$$ \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~ \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l},~ l \in \{1,2,\dots,L\}. $$

  • Note that both input and output are vectors. We can think of the above equation as describing a layer of multiple perceptrons.
  • We'll henceforth refer to such layers as fully-connected or FC layers.
  • The first layer accepts the input of the model, i.e. $\vec{y}_0=\vec{x}\in\set{R}^d$.
  • The last layer, $L$, is the output layer, so $y_L$ is the output of the model.
  • The layers $1, 2, \dots, L-1$ are called hidden layers.

To begin, let's implement a general multi-layer perceptron model. We'll seek to implement it in a way which is both general in terms of architecture, and also composable so that we can use our MLP in the context of larger models.

TODO: Implement the MLP class in the hw2/mlp.py module.

In [8]:
from hw2.mlp import MLP

mlp = MLP(
    in_dim=2,
    dims=[8, 16, 32, 64],
    nonlins=['relu', 'tanh', nn.LeakyReLU(0.314), 'softmax']
)
mlp
Out[8]:
MLP(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=2, out_features=8, bias=True)
    (1): Linear(in_features=8, out_features=16, bias=True)
    (2): Linear(in_features=16, out_features=32, bias=True)
    (3): Linear(in_features=32, out_features=64, bias=True)
  )
  (activations): ModuleList(
    (0): ReLU()
    (1): Tanh()
    (2): LeakyReLU(negative_slope=0.314)
    (3): Softmax(dim=1)
  )
)

Let's try our implementation on a batch of data.

In [9]:
x0, y0 = next(iter(dl_train))

yhat0 = mlp(x0)

test.assertEqual(len([*mlp.parameters()]), 8)
test.assertEqual(yhat0.shape, (batch_size, mlp.out_dim))
test.assertTrue(torch.allclose(torch.sum(yhat0, dim=1), torch.tensor(1.0)))
test.assertIsNotNone(yhat0.grad_fn)

yhat0
Out[9]:
tensor([[0.0147, 0.0209, 0.0137,  ..., 0.0133, 0.0161, 0.0174],
        [0.0150, 0.0186, 0.0134,  ..., 0.0131, 0.0178, 0.0176],
        [0.0148, 0.0196, 0.0134,  ..., 0.0132, 0.0171, 0.0176],
        ...,
        [0.0146, 0.0201, 0.0135,  ..., 0.0132, 0.0168, 0.0176],
        [0.0144, 0.0217, 0.0136,  ..., 0.0129, 0.0147, 0.0174],
        [0.0151, 0.0192, 0.0138,  ..., 0.0133, 0.0174, 0.0176]],
       grad_fn=<SoftmaxBackward0>)

MLP for Binary Classification¶

The MLP model we've implemented, while useful, is very general. For the task of binary classification, we would like to add some additional functionality to it: the ability to output a normalized score for a sample being in class one (which we interpret as a probability) and a prediction based on some threshold of this probability. In addition, we need some way to calculate a meaningful threshold based on the data and a trained model at hand.

In order to maintain generality, we'll add this functionlity in the form of a wrapper: A BinaryClassifier class that can wrap any model producing two output features, and provide the the functionality stated above.

TODO: In the hw2/classifier.py module, implement the BinaryClassifier and the missing parts of its base class, Classifier. Read the method documentation carefully and implement accordingly. You can ignore the roc_threshold method at this stage.

In [10]:
from hw2.classifier import BinaryClassifier

bmlp4 = BinaryClassifier(
    model=MLP(in_dim=2, dims=[*[10]*3, 2], nonlins=[*['relu']*3, 'none']),
    threshold=0.5
)
print(bmlp4)

# Test model
test.assertEqual(len([*bmlp4.parameters()]), 8)
test.assertIsNotNone(bmlp4(x0).grad_fn)

# Test forward
yhat0_scores = bmlp4(x0)
test.assertEqual(yhat0_scores.shape, (batch_size, 2))
test.assertFalse(torch.allclose(torch.sum(yhat0_scores, dim=1), torch.tensor(1.0)))

# Test predict_proba
yhat0_proba = bmlp4.predict_proba(x0)
test.assertEqual(yhat0_proba.shape, (batch_size, 2))
test.assertTrue(torch.allclose(torch.sum(yhat0_proba, dim=1), torch.tensor(1.0)))

# Test classify
yhat0 = bmlp4.classify(x0)
test.assertEqual(yhat0.shape, (batch_size,))
test.assertEqual(yhat0.dtype, torch.int)
test.assertTrue(all(yh_ in (0, 1) for yh_ in yhat0))
BinaryClassifier(
  (model): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=2, out_features=10, bias=True)
      (1): Linear(in_features=10, out_features=10, bias=True)
      (2): Linear(in_features=10, out_features=10, bias=True)
      (3): Linear(in_features=10, out_features=2, bias=True)
    )
    (activations): ModuleList(
      (0): ReLU()
      (1): ReLU()
      (2): ReLU()
      (3): Identity()
    )
  )
)

Training¶

Now that we have a classifier, we need to train it. We will abstract the various aspects of training such as mlutiple epochs, iterating over batches, early stopping and saving model checkpoints, into a Trainer that will take care of these concerns.

The Trainer class splits the task of training (and evaluating) models into three conceptual levels,

  • Multiple epochs - the fit method, which returns a FitResult containing losses and accuracies for all epochs.
  • Single epoch - the train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.
  • Single batch - the train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.

It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.

TODO:

  1. Implement the Trainer's fit method and the ClassifierTrainer's train_batch/test_batch methods, in the hw2/training.py module. You may ignore the Optional parts about early stopping an model checkpoints at this stage.

  2. Set the model's architecture hyper-parameters and the optimizer hyperparameters in part3_arch_hp() and part3_optim_hp(), respectively, in hw2/answers.py.

Since this is a toy dataset, you should be able to quickly get above 85% accuracy even on the test set.

In [11]:
from hw2.training import ClassifierTrainer
from hw2.answers import part3_arch_hp, part3_optim_hp

torch.manual_seed(seed)

hp_arch = part3_arch_hp()
hp_optim = part3_optim_hp()

model = BinaryClassifier(
    model=MLP(
        in_dim=2,
        dims=[*[hp_arch['hidden_dims'],]*hp_arch['n_layers'], 2],
        nonlins=[*[hp_arch['activation'],]*hp_arch['n_layers'], hp_arch['out_activation']]
    ),
    threshold=0.5,
)
print(model)

loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
trainer = ClassifierTrainer(model, loss_fn, optimizer)

fit_result = trainer.fit(dl_train, dl_valid, num_epochs=20, print_every=10);

test.assertGreaterEqual(fit_result.train_acc[-1], 85.0)
test.assertGreaterEqual(fit_result.test_acc[-1], 75.0)
BinaryClassifier(
  (model): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=2, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Linear(in_features=64, out_features=2, bias=True)
    )
    (activations): ModuleList(
      (0): ReLU()
      (1): ReLU()
      (2): ReLU()
      (3): Identity()
    )
  )
)
--- EPOCH 1/20 ---
train_batch:   0%|          | 0/167 [00:00<?, ?it/s]
test_batch:   0%|          | 0/84 [00:00<?, ?it/s]
--- EPOCH 11/20 ---
train_batch:   0%|          | 0/167 [00:00<?, ?it/s]
test_batch:   0%|          | 0/84 [00:00<?, ?it/s]
--- EPOCH 20/20 ---
train_batch:   0%|          | 0/167 [00:00<?, ?it/s]
test_batch:   0%|          | 0/84 [00:00<?, ?it/s]
In [12]:
from cs236781.plot import plot_fit

plot_fit(fit_result, log_loss=False, train_test_overlay=True);
No description has been provided for this image

Decision Boundary¶

An important part of understanding what a non-linear classifier like our MLP is doing is visualizing it's decision boundaries. When we only have two input features, these are relatively simple to visualize, since we can simply plot our data on the plane, and evaluate our classifier on a constant 2D grid in order to approximate the decision boundary.

TODO: Implement the plot_decision_boundary_2d function in the hw2/classifier.py module.

In [13]:
from hw2.classifier import plot_decision_boundary_2d

fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
No description has been provided for this image

Threshold Selection¶

Another important component, especially in the context of binary classification is threshold selection. Until now, we arbitrarily chose a threshold of 0.5 when deciding the class label based on the probability score we calculated via softmax. In other words, we classified a sample to class 1 (the 'positive' class) when it's probability score was greater or equal to 0.5.

However, in real-world classifiction problems we'll need to choose our threshold wisely based on the domain-specific requirements of the problem. For example, depending on our application, we might care more about high sensitivity (correctly classifying positive examples), while for other applications specificity (correctly classifying negative examples) is more important.

One way to understand the mistakes a model is making is to look at its Confusion Matrix. From it, we easily see e.g. the false-negative rate (FNR) and false-positive rate (FPR).

Let's look at the confusion matrices on the test and validation data using the model we trained above.

In [14]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion(classifier, x: np.ndarray, y: np.ndarray, ax=None):
    y_hat = classifier.classify(torch.from_numpy(x).to(torch.float32)).numpy()
    conf_mat = confusion_matrix(y, y_hat, normalize='all')
    ConfusionMatrixDisplay(conf_mat).plot(ax=ax, colorbar=False)
    
model.threshold = 0.5

_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
No description has been provided for this image

We can see that the model makes a different number of false-posiive and false-negative errors. Clearly, this proportion would change if the classification threshold was different.

A very common way to select the classification threshold is to find a threshold which optimally balances between the FPR and FNR. This can be done by plotting the model's ROC curve, which shows 1-FNR vs. FPR for multiple threshold values, and selecting the point closest to the ideal point ((0, 1)).

TODO: Implement the select_roc_thresh function in the hw2.classifier module.

In [15]:
from hw2.classifier import select_roc_thresh


optimal_thresh = select_roc_thresh(model, *dl_valid.dataset.tensors, plot=True)
No description has been provided for this image

Let's see the effect of our threshold selection on the confusion matrix and decision boundary.

In [16]:
model.threshold = optimal_thresh

_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
No description has been provided for this image
No description has been provided for this image

Architecture Experiments¶

Now, equipped with the tools we've implemented so far we'll expertiment with various MLP architectures. We'll seek to study the effect of the models depth (number of hidden layers) and width (number of neurons per hidden layer) on the its decision boundaries and the resulting performance. After training, we will use the validation set for threshold selection, and seek to maximize the performance on the test set.

TODO: Implement the mlp_experiment function in hw2/experiments.py. You are free to configure any model and optimization hyperparameters however you like, except for the specified width and depth. Experiment with various options for these other hyperparameters and try to obtain the best results you can.

In [17]:
from itertools import product
from tqdm.auto import tqdm
from hw2.experiments import mlp_experiment

torch.manual_seed(seed)

depths = [1, 2, 4]
widths = [2, 8, 32]
exp_configs = product(enumerate(widths), enumerate(depths))
fig, axes = plt.subplots(len(widths), len(depths), figsize=(10*len(depths), 10*len(widths)), squeeze=False)
test_accs = []

for (i, width), (j, depth) in tqdm(list(exp_configs)):
    model, thresh, valid_acc, test_acc = mlp_experiment(
        depth, width, dl_train, dl_valid, dl_test, n_epochs=10
    )
    test_accs.append(test_acc)
    fig, ax = plot_decision_boundary_2d(model, *dl_test.dataset.tensors, ax=axes[i, j])
    ax.set_title(f"{depth=}, {width=}")
    ax.text(ax.get_xlim()[0]*.95, ax.get_ylim()[1]*.95, f"{thresh=:.2f}\n{valid_acc=:.1f}%\n{test_acc=:.1f}%", va="top")
    
# Assert minimal performance requirements.
# You should be able to do better than these by at least 5%.
test.assertGreaterEqual(np.min(test_accs), 75.0)
test.assertGreaterEqual(np.quantile(test_accs, 0.75), 85.0)
  0%|          | 0/9 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
test_batch:   0%|          | 0/63 [00:00<?, ?it/s]
No description has been provided for this image

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.

In [18]:
from cs236781.answers import display_answer
import hw2.answers

Question 1¶

Consider the first binary classifier you trained in this notebook and the loss/accuracy curves we plotted for it on the train and validation sets, as well as the decision boundary plot.

  1. Explain what each type of error is.

    1. Optimization error.
    2. Generalization error.
    3. Approximation error.
  2. Based on those plots, explain qualitatively whether or not your model has high error in each of the above types.

note: Since this is a qualitative question, assume "high" simply means "I would take measures in order to decrease it further".

In [19]:
display_answer(hw2.answers.part3_q1)

Your answer:

  1. Let's explain what each type of error means.

    1. Optimization error:

      This error is caused by a bad optimizer(might be because bad hyper parameters). It is the gap between the best possible model in the hypothesis class and the model found by training.

    2. Generalization error:

      This error is caused by the model overfitting the training data. It is defined as the gap between the models performance on the train and test sets. We can look at it as a measure of how accuratly the model is able to predict outcomes for data it has never seen before.

    3. Approximation error:

      This error caused by the limitations of the model to fit the data. It is the gap between the best model in the hypothesis class and the true target function. Good accuracy often leades to small approximation loss. If our model is underfitting, then it has high approximation error.

  2. Let's take a look at each of the errors from the previous section of this question.

    1. Our model does not have high optimization error.

      The given "moon" data the model is trained on is noisy by itself. Therefore the ~95% accuracy we get in the train set can assure us that the optimizer is fine.

    2. Our model have a little generalization error, but nothng major.

      We can see from the plots that around epoch 8 begins a gap between the test and train in both plots. This tells us that the model is slightly overfitting. But, because we are still getting good performances in the test sets, the generalization error is not high.

    3. Our model does not have high approximation error.

      The model gets good results on the data, and is able to fit it quite good. Therefore we can deduce that the hypothesis class is expressive enough to model the underlying structure of the data.

Question 2¶

Recall the two matrics we track from the confusion matrix, FPR and FNR. Provide an example scenario where we would prefer to optimize for FPR at the cost of increasing FNR and vice versa.

Explain your answers.

In [20]:
display_answer(hw2.answers.part3_q2)

Your answer:

Example 1 - prefer to optimize FPR.

We would want to minimize False Positive Rate, if false positives have bad consequences.

For example, suppose our model predicts whether or not a patient is free to leave the hospital. A false positve result would mean that we let an ill person leave the hospital. He is then not treated as he should be, and might spread his disease! On the other hand, we can compromise on high False Negative Rate, which means that we keep a patient that can already leave the hospital. This will not have fatal consequences and therfore we care less about that.

Example 2 - prefer optimize FNR.

We would want to minimize False Negative Rate, if false negatives have bad consequences.

For example, suppose our model predicts whether or not a patient need to take some medicine. A false negative would mean that a patient will not take his medicine. A false positive would mean that a patient will take a medicine he do not need to take. We assume that the medicine cannot cause health issues, but not taking it is unhealthy. Therefore we would prefer to optimize the FNR at the cost of increasing FPR.

Question 3¶

Analyze your results from the Architecture Experiment.

  1. Explain the decision boundaries and model performance you obtained for the columns (fixed depth, width varies).
  2. Explain the decision boundaries and model performance you obtained for the rows (fixed width, depth varies).
  3. Compare and explain the results for the following pair of configurations, which have the same number of total parameters:
    • depth=1, width=32 and depth=4, width=8
  4. Explain the effect of threshold selection on the validation set: did it improve the results on the test set? why?
In [21]:
display_answer(hw2.answers.part3_q3)

Your answer:

  1. Each column represents a fixed depth.

    As we can see from the plots, in every column there is a gradual increase in model complexity as the model gain more width. As a result, the decision boundaries evolve from simply linear to more complex areas. This corresponeds to the Universal Approximation Theorem, which states that there exists a network with a single finite layer that can fit every continuous function "arbitrarily well". We see this clearly in the first column(depth=1). The performance also increase as the width is getting larger, until a certain point where its starts to decrease as a result of overgitting or optimization problems.

  2. Each row represents a fixed width.

    As we can see from the plots, there is not a lot of change in decition boundaries between the different depth in each row. In particular, we can notice that the best test accuracy is achieved by:

    • Row 1 (Width=2): Column 3 (Depth=4)
    • Row 2 (Width=8): Column 2 (Depth=2)
    • Row 3 (Width=32): Column 1 (Depth=1)

    This suggest that as the width increase, adding depth does not necessarily improve performance, and might actually hurt it. For the 2D "moon" data, a single wide layer is sufficient, and gets the best test accuracy. Also, we might get that the optimizer is having problems optimizing the deeper networks(duo to vanishing gradients for example).

  3. We got very similar performances between the two models.

    Adding depth to the model, adds sequential non-linear activation functions. As a result, altough both of the models have the same number of parameters, the deeper one might represent more complex datasets(note that the hypothesis classes are different). Because the data is not very complex, a single wide layer is sufficient. Therefore both models get very similar accuracies, and the shallow one has slightly better generalization.

  4. Threshold selection improved the results on the test set.

    As we saw earlier in this part, it help us get a better balance between FPR and FNR, and better performances in general. That is duo to better interpretation of the outputs. In the model experiment, each model chose a difference threshold, which improved test set resaults. Some models chose a threshold as low as 0.12, which is quite far then the standard thresh=0.5.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 4: Convolutional Neural Networks¶

In this part we will explore convolution networks. We'll implement a common block-based deep CNN pattern with an without residual connections.

In [1]:
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf

%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()

Reminder: Convolutional layers and networks¶

Convolutional layers are the most essential building blocks of the state of the art deep learning image classification models and also play an important role in many other tasks. As we saw in the tutorial, when applied to images, convolutional layers operate on and produce volumes (3D tensors) of activations.

A convenient way to interpret convolutional layers for images is as a collection of 3D learnable filters, each of which operates on a small spatial region of the input volume. Each filter is convolved with the input volume ("slides over it"), and a dot product is computed at each location followed by a non-linearity which produces one activation. All these activations produce a 2D plane known as a feature map. Multiple feature maps (one for each filter) comprise the output volume.

No description has been provided for this image

A crucial property of convolutional layers is their translation equivariance, i.e. shifting the input results in and equivalently shifted output. This produces the ability to detect features regardless of their spatial location in the input.

Convolutional network architectures usually follow a pattern basic repeating blocks: one or more convolution layers, each followed by a non-linearity (generally ReLU) and then a pooling layer to reduce spatial dimensions. Usually, the number of convolutional filters increases the deeper they are in the network. These layers are meant to extract features from the input. Then, one or more fully-connected layers is used to combine the extracted features into the required number of output class scores.

Building convolutional networks with PyTorch¶

PyTorch provides all the basic building blocks needed for creating a convolutional arcitecture within the torch.nn package. Let's use them to create a basic convolutional network with the following architecture pattern:

[(CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC

Here $N$ is the total number of convolutional layers, $P$ specifies how many convolutions to perform before each pooling layer and $M$ specifies the number of hidden fully-connected layers before the final output layer.

TODO: Complete the implementaion of the CNN class in the hw2/cnn.py module. Use PyTorch's nn.Conv2d and nn.MaxPool2d for the convolution and pooling layers. It's recommended to implement the missing functionality in the order of the class' methods.

In [3]:
from hw2.cnn import CNN

test_params = [
    dict(
        in_size=(3,100,100), out_classes=10,
        channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
        conv_params=dict(kernel_size=3, stride=1, padding=1),
        activation_type='relu', activation_params=dict(),
        pooling_type='max', pooling_params=dict(kernel_size=2),
    ),
    dict(
        in_size=(3,100,100), out_classes=10,
        channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
        conv_params=dict(kernel_size=5, stride=2, padding=3),
        activation_type='lrelu', activation_params=dict(negative_slope=0.05),
        pooling_type='avg', pooling_params=dict(kernel_size=3),
    ),
    dict(
        in_size=(3,100,100), out_classes=3,
        channels=[16]*5, pool_every=3, hidden_dims=[100]*1,
        conv_params=dict(kernel_size=2, stride=2, padding=2),
        activation_type='lrelu', activation_params=dict(negative_slope=0.1),
        pooling_type='max', pooling_params=dict(kernel_size=2),
    ),
]

for i, params in enumerate(test_params):
    torch.manual_seed(seed)
    net = CNN(**params)
    print(f"\n=== test {i=} ===")
    print(net)

    torch.manual_seed(seed)
    test_out = net(torch.ones(1, 3, 100, 100))
    print(f'{test_out=}')

    expected_out = torch.load(f'tests/assets/expected_conv_out_{i:02d}.pt')
    print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
    test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
CNN(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU()
    (7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU()
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=20000, out_features=100, bias=True)
      (1): Linear(in_features=100, out_features=100, bias=True)
      (2): Linear(in_features=100, out_features=10, bias=True)
    )
    (activations): ModuleList(
      (0): ReLU()
      (1): ReLU()
      (2): Identity()
    )
  )
)
test_out=tensor([[ 0.0745, -0.1058,  0.0928,  0.0476,  0.0057,  0.0051,  0.0938, -0.0582,
          0.0573,  0.0583]], grad_fn=<AddmmBackward0>)
max_diff=0.0

=== test i=1 ===
CNN(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
    (1): LeakyReLU(negative_slope=0.05)
    (2): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
    (3): LeakyReLU(negative_slope=0.05)
    (4): AvgPool2d(kernel_size=3, stride=3, padding=0)
    (5): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
    (6): LeakyReLU(negative_slope=0.05)
    (7): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
    (8): LeakyReLU(negative_slope=0.05)
    (9): AvgPool2d(kernel_size=3, stride=3, padding=0)
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=32, out_features=100, bias=True)
      (1): Linear(in_features=100, out_features=100, bias=True)
      (2): Linear(in_features=100, out_features=10, bias=True)
    )
    (activations): ModuleList(
      (0): LeakyReLU(negative_slope=0.05)
      (1): LeakyReLU(negative_slope=0.05)
      (2): Identity()
    )
  )
)
test_out=tensor([[ 0.0724, -0.0030,  0.0637, -0.0073,  0.0932, -0.0662, -0.0656,  0.0076,
          0.0193,  0.0241]], grad_fn=<AddmmBackward0>)
max_diff=0.0

=== test i=2 ===
CNN(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
    (1): LeakyReLU(negative_slope=0.1)
    (2): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
    (3): LeakyReLU(negative_slope=0.1)
    (4): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
    (5): LeakyReLU(negative_slope=0.1)
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
    (8): LeakyReLU(negative_slope=0.1)
    (9): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
    (10): LeakyReLU(negative_slope=0.1)
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=400, out_features=100, bias=True)
      (1): Linear(in_features=100, out_features=3, bias=True)
    )
    (activations): ModuleList(
      (0): LeakyReLU(negative_slope=0.1)
      (1): Identity()
    )
  )
)
test_out=tensor([[-0.0004, -0.0094,  0.0817]], grad_fn=<AddmmBackward0>)
max_diff=0.0

As before, we'll wrap our model with a Classifier that provides the necessary functionality for calculating probability scores and obtaining class label predictions. This time, we'll use a simple approach that simply selects the class with the highest score.

TODO: Implement the ArgMaxClassifier in the hw2/classifier.py module.

In [4]:
from hw2.classifier import ArgMaxClassifier

model = ArgMaxClassifier(model=CNN(**test_params[0]))

test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test.assertEqual(model.classify(test_image).shape, (1,))
test.assertEqual(model.predict_proba(test_image).shape, (1, 10))
test.assertAlmostEqual(torch.sum(model.predict_proba(test_image)).item(), 1.0, delta=1e-3)

Let's now load CIFAR-10 to use as our dataset.

In [5]:
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())

print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')

x0,_ = ds_train[0]
in_size = x0.shape
num_classes = 10
print('input image size =', in_size)
Files already downloaded and verified
Files already downloaded and verified
Train: 50000 samples
Test: 10000 samples
input image size = torch.Size([3, 32, 32])

Now as usual, as a sanity test let's make sure we can overfit a tiny dataset with our model. But first we need to adapt our Trainer for PyTorch models.

TODO:

  1. Complete the implementaion of the ClassifierTrainer class in the hw2/training.py module if you haven't done so already.
  2. Set the optimizer hyperparameters in part4_optim_hp(), respectively, in hw2/answers.py.
In [6]:
from hw2.training import ClassifierTrainer
from hw2.answers import part4_optim_hp

torch.manual_seed(seed)

# Define a tiny part of the CIFAR-10 dataset to overfit it
batch_size = 2
max_batches = 25
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)

# Create model, loss and optimizer instances
model = ArgMaxClassifier(
    model=CNN(
        in_size, num_classes, channels=[32], pool_every=1, hidden_dims=[100],
        conv_params=dict(kernel_size=3, stride=1, padding=1),
        pooling_params=dict(kernel_size=2),
    )
)

hp_optim = part4_optim_hp()
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)

# Use ClassifierTrainer to run only the training loop a few times.
trainer = ClassifierTrainer(model, loss_fn, optimizer, device)
best_acc = 0
for i in range(25):
    res = trainer.train_epoch(dl_train, max_batches=max_batches, verbose=(i%5==0))
    best_acc = res.accuracy if res.accuracy > best_acc else best_acc
    
# Test overfitting
test.assertGreaterEqual(best_acc, 90)
train_batch:   0%|          | 0/25 [00:00<?, ?it/s]
train_batch:   0%|          | 0/25 [00:00<?, ?it/s]
train_batch:   0%|          | 0/25 [00:00<?, ?it/s]
train_batch:   0%|          | 0/25 [00:00<?, ?it/s]
train_batch:   0%|          | 0/25 [00:00<?, ?it/s]

Residual Networks¶

A very common addition to the basic convolutional architecture described above are shortcut connections. First proposed by He et al. (2016), this simple addition has been shown to be crucial ingredient in order to achieve effective learning with very deep networks. Virtually all state of the art image classification models from recent years use this technique.

The idea is to add an shortcut, or skip, around every two or more convolutional layers:

No description has been provided for this image

This adds an easy way for the network to learn identity mappings: set the weight values to be very small. The consequence is that the convolutional layers to learn a residual mapping, i.e. some delta that is applied to the identity map, instead of actually learning a completely new mapping from scratch.

Lets start by implementing a general residual block, representing a structure similar to the above diagrams. Our residual block will be composed of:

  • A "main path" with some number of convolutional layers with ReLU between them. Optionally, we'll also apply dropout and batch normalization layers (in this order) between the convolutions, before the ReLU.
  • A "shortcut path" implementing an identity mapping around the main path. In case of a different number of input/output channels, the shortcut path should contain an additional 1x1 convolution to project the channel dimension.
  • The sum of the main and shortcut paths output is passed though a ReLU and returned.

TODO: Complete the implementation of the ResidualBlock's __init__() method in the hw2/cnn.py module.

In [7]:
from hw2.cnn import ResidualBlock

torch.manual_seed(seed)

resblock = ResidualBlock(
    in_channels=3, channels=[6, 4]*2, kernel_sizes=[3, 5]*2,
    batchnorm=True, dropout=0.2
)
print(resblock)

torch.manual_seed(seed)
test_out = resblock(torch.ones(1, 3, 32, 32))
print(f'{test_out.shape=}')

expected_out = torch.load('tests/assets/expected_resblock_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBlock(
  (main_path): Sequential(
    (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): Dropout2d(p=0.2, inplace=False)
    (2): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): ReLU()
    (4): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (5): Dropout2d(p=0.2, inplace=False)
    (6): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): ReLU()
    (8): Conv2d(4, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): Dropout2d(p=0.2, inplace=False)
    (10): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU()
    (12): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  )
  (shortcut_path): Sequential(
    (0): Conv2d(3, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)
  )
)
test_out.shape=torch.Size([1, 4, 32, 32])
max_diff=5.960464477539062e-07

Bottleneck Blocks¶

In the ResNet Block diagram shown above, the right block is called a bottleneck block. This type of block is mainly used deep in the network, where the feature space becomes increasingly high-dimensional (i.e. there are many channels).

Instead of applying a KxK conv layer on the original input channels, a bottleneck block first projects to a lower number of features (channels), applies the KxK conv on the result, and then projects back to the original feature space. Both projections are performed with 1x1 convolutions.

TODO: Complete the implementation of the ResidualBottleneckBlock in the hw2/cnn.py module.

In [8]:
from hw2.cnn import ResidualBottleneckBlock

torch.manual_seed(seed)
resblock_bn = ResidualBottleneckBlock(
    in_out_channels=256, inner_channels=[64, 32, 64], inner_kernel_sizes=[3, 5, 3],
    batchnorm=False, dropout=0.1, activation_type="lrelu"
)
print(resblock_bn)

# Test a forward pass
torch.manual_seed(seed)
test_in  = torch.ones(1, 256, 32, 32)
test_out = resblock_bn(test_in)
print(f'{test_out.shape=}')
assert test_out.shape == test_in.shape 

expected_out = torch.load('tests/assets/expected_resblock_bn_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBottleneckBlock(
  (main_path): Sequential(
    (0): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
    (1): Dropout2d(p=0.1, inplace=False)
    (2): LeakyReLU(negative_slope=0.01)
    (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): Dropout2d(p=0.1, inplace=False)
    (5): LeakyReLU(negative_slope=0.01)
    (6): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (7): Dropout2d(p=0.1, inplace=False)
    (8): LeakyReLU(negative_slope=0.01)
    (9): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (10): Dropout2d(p=0.1, inplace=False)
    (11): LeakyReLU(negative_slope=0.01)
    (12): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
  )
  (shortcut_path): Sequential(
    (0): Identity()
  )
)
test_out.shape=torch.Size([1, 256, 32, 32])
max_diff=1.1920928955078125e-07

Now, based on the ResidualBlock, we'll implement our own variation of a residual network (ResNet), with the following architecture:

[-> (CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
 \------- SKIP ------/
 

Note that $N$, $P$ and $M$ are as before, however now $P$ also controls the number of convolutional layers to add a skip-connection to.

TODO: Complete the implementation of the ResNet class in the hw2/cnn.py module. You must use your ResidualBlocks or ResidualBottleneckBlocks to group together every $P$ convolutional layers.

In [9]:
from hw2.cnn import ResNet

test_params = [
    dict(
        in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
        pool_every=4, hidden_dims=[100]*2,
        activation_type='lrelu', activation_params=dict(negative_slope=0.01),
        pooling_type='avg', pooling_params=dict(kernel_size=2),
        batchnorm=True, dropout=0.1,
        bottleneck=False
    ),
    dict(
        # create 64->16->64 bottlenecks
        in_size=(3,100,100), out_classes=5, channels=[64, 16, 64]*4,
        pool_every=3, hidden_dims=[64]*1,
        activation_type='tanh',
        pooling_type='max', pooling_params=dict(kernel_size=2),
        batchnorm=True, dropout=0.1,
        bottleneck=True
    )
]

for i, params in enumerate(test_params):
    torch.manual_seed(seed)
    net = ResNet(**params)
    print(f"\n=== test {i=} ===")
    print(net)

    torch.manual_seed(seed)
    test_out = net(torch.ones(1, 3, 100, 100))
    print(f'{test_out=}')
    
    expected_out = torch.load(f'tests/assets/expected_resnet_out_{i:02d}.pt')
    print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
    test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
ResNet(
  (feature_extractor): Sequential(
    (0): ResidualBlock(
      (main_path): Sequential(
        (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): LeakyReLU(negative_slope=0.01)
        (4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): Dropout2d(p=0.1, inplace=False)
        (6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): LeakyReLU(negative_slope=0.01)
        (8): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (9): Dropout2d(p=0.1, inplace=False)
        (10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (11): LeakyReLU(negative_slope=0.01)
        (12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      )
    )
    (1): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (2): ResidualBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): LeakyReLU(negative_slope=0.01)
        (4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Identity()
      )
    )
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=160000, out_features=100, bias=True)
      (1): Linear(in_features=100, out_features=100, bias=True)
      (2): Linear(in_features=100, out_features=10, bias=True)
    )
    (activations): ModuleList(
      (0): LeakyReLU(negative_slope=0.01)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Identity()
    )
  )
)
test_out=tensor([[ 0.0422,  0.0332,  0.1870, -0.0532, -0.0742,  0.1143, -0.0617, -0.0467,
          0.0852,  0.0221]], grad_fn=<AddmmBackward0>)
max_diff=1.1920928955078125e-07

=== test i=1 ===
ResNet(
  (feature_extractor): Sequential(
    (0): ResidualBlock(
      (main_path): Sequential(
        (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): Tanh()
        (4): Conv2d(64, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): Dropout2d(p=0.1, inplace=False)
        (6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): Tanh()
        (8): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      )
    )
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): ResidualBottleneckBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): Tanh()
        (4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): Dropout2d(p=0.1, inplace=False)
        (6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): Tanh()
        (8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Identity()
      )
    )
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): ResidualBottleneckBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): Tanh()
        (4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): Dropout2d(p=0.1, inplace=False)
        (6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): Tanh()
        (8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Identity()
      )
    )
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): ResidualBottleneckBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
        (1): Dropout2d(p=0.1, inplace=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): Tanh()
        (4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): Dropout2d(p=0.1, inplace=False)
        (6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (7): Tanh()
        (8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Identity()
      )
    )
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=2304, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=5, bias=True)
    )
    (activations): ModuleList(
      (0): Tanh()
      (1): Identity()
    )
  )
)
test_out=tensor([[ 0.0237, -0.1945, -0.0085, -0.4024, -0.2667]],
       grad_fn=<AddmmBackward0>)
max_diff=2.3096799850463867e-07

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.

In [10]:
from cs236781.answers import display_answer
import hw2.answers

Question 1¶

Consider the bottleneck block from the right side of the ResNet diagram above. Compare it to a regular block that performs a two 3x3 convs directly on the 256-channel input (i.e. as shown in the left side of the diagram, with a different number of channels). Explain the differences between the regular block and the bottleneck block in terms of:

  1. Number of parameters. Calculate the exact numbers for these two examples. Show your work.
  2. Number of floating point operations required to compute an output (qualitative assessment).
  3. Ability to combine the input: (1) spatially (within feature maps); (2) across feature maps.
In [11]:
display_answer(hw2.answers.part4_q1)

1. Number of parameters:

The number of parameters in a convolutional layer is given by: $$C_{\text{in}} \times C_{\text{out}} \times K_h \times K_w$$ where $C_{\text{in}}$ is the number of input channels, $C_{\text{out}}$ is the number of output channels, and $K_h, K_w$ are the kernel height and width.

Assume the regular block consists of two 3×3 convolutional layers, each with 256 input and 256 output channels.

Each 3×3 convolution has $256 \times 256 \times 3 \times 3 = 589,824$ parameters.

Total for regular block: $2 \times 589,824 = 1,179,648$ parameters.

The bottleneck block consists of:

  • 1×1 convolution: 256 input → 64 output channels: $256 \times 64 \times 1 \times 1 = 16,384$ parameters.
  • 3×3 convolution: 64 input → 64 output channels: $64 \times 64 \times 3 \times 3 = 36,864$ parameters.
  • 1×1 convolution: 64 input → 256 output channels: $64 \times 256 \times 1 \times 1 = 16,384$ parameters.

Total for bottleneck block: $16,384 + 36,864 + 16,384 = 69,632$ parameters.

The bottleneck block has significantly fewer parameters due to the reduced channel dimension in the middle layer.

2. Number of floating point operations:

Convolutional layers' FLOPs are roughly proportional to $\text{number of conv layer parameters} \times H_{\text{out}} \times W_{\text{out}}$ (assuming the same output spatial dimensions for both blocks).

The regular block has two 3×3 convolutions, each with $256 \times 256 \times 9$ operations per spatial position, totaling approximately $2 \times 256 \times 256 \times 9 \times H_{\text{out}} \times W_{\text{out}} = 1,179,648 \times H_{\text{out}} \times W_{\text{out}}$ FLOPs.

The bottleneck block has:

  • 1×1 convolution: $256 \times 64 \times 1$ operations per spatial position.
  • 3×3 convolution: $64 \times 64 \times 9$ operations per spatial position.
  • 1×1 convolution: $64 \times 256 \times 1$ operations per spatial position.

Totaling approximately $(256 \times 64 + 64 \times 64 \times 9 + 64 \times 256) \times H_{\text{out}} \times W_{\text{out}} = 69,632 \times H_{\text{out}} \times W_{\text{out}}$ FLOPs.

The bottleneck block requires significantly fewer FLOPs because the costly 3×3 convolution is performed on a reduced number of channels (64 instead of 256), and the 1×1 convolutions have smaller kernels, making them computationally inexpensive.

3. Ability to combine the input:

(1) Spatially (within feature maps): The regular block, with two stacked 3×3 convolutions, has a larger effective receptive field and performs more spatial mixing operations, allowing better combination of spatial information within feature maps compared to the bottleneck block's single 3×3 convolution on reduced channels.

(2) Across feature maps: The bottleneck block is better here due to the 1×1 convolutions, which efficiently combine information across all input channels. The regular block's 3×3 convolutions also combine across channels but are less efficient for this purpose compared to dedicated 1×1 layers.

Question 2:¶

In this question we will understand another effect the skip connections shown in the ResNet diagram above have when training deep networks.

Consider the following toy model:

Let $M$ be a $m \times n$ matrix with small entries meaning $ \forall i,j ; |M_{i,j}| < 1 $. This is a realistic assumption - recall xavier weight initialization.

  1. Define $y_1 = M \cdot x_1$. Given $\frac{\partial L}{\partial y_1}$, derive $\frac{\partial L}{\partial x_1} $.

  2. Define $ y_2 = x_2 + M \cdot x_2 $. Given $ \frac{\partial L}{\partial y_2} $, derive $ \frac{\partial L}{\partial x_2} $.

  3. Based on your derivations above, explain why the residual form helps prevent vanishing gradients in deep networks consisting of many such layers in succession.

Show your work and explain your answer.

In [12]:
display_answer(hw2.answers.part4_q2)

1. Given $\frac{\partial L}{\partial y_1}$, derive $\frac{\partial L}{\partial x_1}$.

Since $y_1 = M x_1$, the Jacobian $\frac{\partial y_1}{\partial x_1} = M$.

Thus, $\frac{\partial L}{\partial x_1} = M^\top \frac{\partial L}{\partial y_1}$.

2. Given $\frac{\partial L}{\partial y_2}$, derive $\frac{\partial L}{\partial x_2}$.

Since $y_2 = x_2 + M x_2 = (I + M) x_2$, the Jacobian $\frac{\partial y_2}{\partial x_2} = I + M$.

Thus, $\frac{\partial L}{\partial x_2} = (I + M)^\top \frac{\partial L}{\partial y_2}$.

3. In deep networks with many layers, gradients are computed by multiplying Jacobians (or their transposes) along the chain rule.

For a non-residual layer ($y = M x$), the gradient flow is $\frac{\partial L}{\partial x} = M^\top \frac{\partial L}{\partial y}$. If $|M| < 1$ elementwise, repeated multiplication can cause gradients to vanish exponentially.

For residual layers ($y = x + M x$), the gradient is $(I + M)^\top \frac{\partial L}{\partial y}$. The identity term $I$ ensures that part of the gradient (at least 1) is preserved through each layer, preventing vanishing gradients even when $M$ has small entries.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 5: Convolutional Architecture Experiments¶

In this part we will explore convolution networks and the effects of their architecture on accuracy. We'll use our deep CNN implementation and perform various experiments on it while varying the architecture. Then we'll implement our own custom architecture to see whether we can get high classification results on a large subset of CIFAR-10.

Training will be performed on GPU.

In [1]:
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf

%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()

Experimenting with model architectures¶

We will now perform a series of experiments that train various model configurations on a part of the CIFAR-10 dataset.

To perform the experiments, you'll need to use a machine with a GPU since training time might be too long otherwise.

Note about running on GPUs¶

Here's an example of running a forward pass on the GPU (assuming you're running this notebook on a GPU-enabled machine).

In [3]:
from hw2.cnn import ResNet

net = ResNet(
    in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
    pool_every=4, hidden_dims=[100]*2,
    pooling_type='avg', pooling_params=dict(kernel_size=2),
)
net = net.to(device)

test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_image = test_image.to(device)

test_out = net(test_image)

Notice how we called .to(device) on both the model and the input tensor. Here the device is a torch.device object that we created above. If an nvidia GPU is available on the machine you're running this on, the device will be 'cuda'. When you run .to(device) on a model, it recursively goes over all the model parameter tensors and copies their memory to the GPU. Similarly, calling .to(device) on the input image also copies it.

In order to train on a GPU, you need to make sure to move all your tensors to it. You'll get errors if you try to mix CPU and GPU tensors in a computation.

In [4]:
print(f'This notebook is running with device={device}')
print(f'The model parameter tensors are also on device={next(net.parameters()).device}')
print(f'The test image is also on device={test_image.device}')
print(f'The output is therefore also on device={test_out.device}')
This notebook is running with device=cuda
The model parameter tensors are also on device=cuda:0
The test image is also on device=cuda:0
The output is therefore also on device=cuda:0

Notes on using course servers¶

First, please read the course servers guide carefully.

To run the experiments on the course servers, you can use the py-sbatch.sh script directly to perform a single experiment run in batch mode (since it runs python once), or use the srun command to do a single run in interactive mode. For example, running a single run of experiment 1 interactively (after conda activate of course):

srun -c 2 --gres=gpu:1 --pty python -m hw2.experiments run-exp -n test -K 32 64 -L 2 -P 2 -H 100

To perform multiple runs in batch mode with sbatch (e.g. for running all the configurations of an experiments), you can create your own script based on py-sbatch.sh and invoke whatever commands you need within it.

Don't request more than 2 CPU cores and 1 GPU device for your runs. The code won't be able to utilize more than that anyway, so you'll see no performance gain if you do. It will only cause delays for other students using the servers.

General notes for running experiments¶

  • You can run the experiments on a different machine (e.g. the course servers) and copy the results (files) to the results folder on your local machine. This notebook will only display the results, not run the actual experiment code (except for a demo run).
  • It's important to give each experiment run a name as specified by the notebook instructions later on. Each run has a run_name parameter that will also be the base name of the results file which this notebook will expect to load.
  • You will implement the code to run the experiments in the hw2/experiments.py module. This module has a CLI parser so that you can invoke it as a script and pass in all the configuration parameters for a single experiment run.
  • You should use python -m hw2.experiments run-exp to run an experiment, and not python hw2/experiments.py run-exp, regardless of how/where you run it.

Experiment 1: Network depth and number of filters¶

In this part we will test some different architecture configurations based on our CNN and ResNet. Specifically, we want to try different depths and number of features to see the effects these parameters have on the model's performance.

To do this, we'll define two extra hyperparameters for our model, K (filters_per_layer) and L (layers_per_block).

  • K is a list, containing the number of filters we want to have in our conv layers.
  • L is the number of consecutive layers with the same number of filters to use.

For example, if K=[32, 64] and L=2 it means we want two conv layers with 32 filters followed by two conv layers with 64 filters. If we also use pool_every=3, the feature-extraction part of our model will be:

Conv(X,32)->ReLu->Conv(32,32)->ReLU->Conv(32,64)->ReLU->MaxPool->Conv(64,64)->ReLU

We'll try various values of the K and L parameters in combination and see how each architecture trains. All other hyperparameters are up to you, including the choice of the optimization algorithm, the learning rate, regularization and architecture hyperparams such as pool_every and hidden_dims. Note that you should select the pool_every parameter wisely per experiment so that you don't end up with zero-width feature maps.

You can try some short manual runs to determine some good values for the hyperparameters or implement cross-validation to do it. However, the dataset size you test on should be large. If you limit the number of batches, make sure to use at least 30000 training images and 5000 validation images.

The important thing is that you state what you used, how you decided on it, and explain your results based on that.

First we need to write some code to run the experiment.

TODO:

  1. Implement the cnn_experiment() function in the hw2/experiments.py module.
  2. If you haven't done so already, it would be an excellent idea to implement the early stopping feature of the Trainer class.

The following block tests that your implementation works. It's also meant to show you that each experiment run creates a result file containing the parameters to reproduce and the FitResult object for plotting.

In [5]:
from hw2.experiments import load_experiment, cnn_experiment
from cs236781.plot import plot_fit

# Test experiment1 implementation on a few data samples and with a small model
cnn_experiment(
    'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,
    filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],
    model_type='resnet',
)

# There should now be a file 'test_run.json' in your `results/` folder.
# We can use it to load the results of the experiment.
cfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')
_, _ = plot_fit(fit_res, train_test_overlay=True)

# And `cfg` contains the exact parameters to reproduce it
print('experiment config: ', cfg)
Files already downloaded and verified
Files already downloaded and verified
--- EPOCH 1/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 2/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 3/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 4/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 5/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 6/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 7/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 8/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 9/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
--- EPOCH 10/10 ---
train_batch:   0%|          | 0/10 [00:00<?, ?it/s]
test_batch:   0%|          | 0/10 [00:00<?, ?it/s]
*** Output file ./results/test_run_L1_K32-64.json written
experiment config:  {'run_name': 'test_run', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'filters_per_layer': [32, 64], 'layers_per_block': 1, 'pool_every': 1, 'hidden_dims': [100], 'model_type': 'resnet', 'kw': {}}
No description has been provided for this image

We'll use the following function to load multiple experiment results and plot them together.

In [6]:
def plot_exp_results(filename_pattern, results_dir='results'):
    fig = None
    result_files = glob.glob(os.path.join(results_dir, filename_pattern))
    result_files.sort()
    if len(result_files) == 0:
        print(f'No results found for pattern {filename_pattern}.', file=sys.stderr)
        return
    for filepath in result_files:
        m = re.match('exp\d_(\d_)?(.*)\.json', os.path.basename(filepath))
        cfg, fit_res = load_experiment(filepath)
        fig, axes = plot_fit(fit_res, fig, legend=m[2],log_loss=True)
    del cfg['filters_per_layer']
    del cfg['layers_per_block']
    print('common config: ', cfg)

Experiment 1.1: Varying the network depth (L)¶

First, we'll test the effect of the network depth on training.

Configuratons:

  • K=32 fixed, with L=2,4,8,16 varying per run
  • K=64 fixed, with L=2,4,8,16 varying per run

So 8 different runs in total.

Naming runs: Each run should be named exp1_1_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_1_L2_K32.

TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.

In [7]:
plot_exp_results('exp1_1_L*_K32*.json')
common config:  {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image
In [8]:
plot_exp_results('exp1_1_L*_K64*.json')
common config:  {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image

Experiment 1.2: Varying the number of filters per layer (K)¶

Now we'll test the effect of the number of convolutional filters in each layer.

Configuratons:

  • L=2 fixed, with K=[32],[64],[128] varying per run.
  • L=4 fixed, with K=[32],[64],[128] varying per run.
  • L=8 fixed, with K=[32],[64],[128] varying per run.

So 12 different runs in total. To clarify, each run K takes the value of a list with a single element.

Naming runs: Each run should be named exp1_2_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_2_L2_K32.

TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.

In [9]:
plot_exp_results('exp1_2_L2*.json')
common config:  {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image
In [10]:
plot_exp_results('exp1_2_L4*.json')
common config:  {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image
In [11]:
plot_exp_results('exp1_2_L8*.json')
common config:  {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image

Experiment 1.3: Varying both the number of filters (K) and network depth (L)¶

Now we'll test the effect of the number of convolutional filters in each layer.

Configuratons:

  • K=[64, 128] fixed with L=2,3,4 varying per run.

So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.

Naming runs: Each run should be named exp1_3_L{}_K{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp1_3_L2_K64-128-256.

TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.

In [12]:
plot_exp_results('exp1_3*.json')
common config:  {'run_name': 'exp1_3', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'cnn', 'kw': {}}
No description has been provided for this image

Experiment 1.4: Adding depth with Residual Networks¶

Now we'll test the effect of skip connections on the training and performance.

Configuratons:

  • K=[32] fixed with L=8,16,32 varying per run.
  • K=[64, 128, 256] fixed with L=2,4,8 varying per run.

So 6 different runs in total.

Naming runs: Each run should be named exp1_4_L{}_K{}-{}-{} where the braces are placeholders for the values.

TODO: Run the experiment on the above configuration with the ResNet model. Make sure the result file names are as expected. Use the following blocks to display the results.

In [13]:
plot_exp_results('exp1_4_L*_K32.json')
common config:  {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 3, 'hidden_dims': [512], 'model_type': 'resnet', 'kw': {}}
No description has been provided for this image
In [14]:
plot_exp_results('exp1_4_L*_K64*.json')
common config:  {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 50, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 6, 'hidden_dims': [512], 'model_type': 'resnet', 'kw': {}}
No description has been provided for this image

In this part you will create your own custom network architecture based on the CNN you've implemented.

Try to overcome some of the limitations your experiment 1 results, using what you learned in the course.

You are free to add whatever you like to the model, for instance

  • Batch normalization
  • Dropout layers
  • Skip connections, bottlenecks
  • Change kernel spatial sizes and strides
  • Custom blocks or ideas from known architectures (e.g. inception module)

Just make sure to keep the model's init API identical (or maybe just add parameters).

TODO: Implement your custom architecture in the YourCNN class within the hw2/cnn.py module.

In [15]:
from hw2.cnn import YourCNN

net = YourCNN((3,100,100), 10, channels=[32]*4, pool_every=2, hidden_dims=[100]*2)
print(net)

test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_out = net(test_image)
print('out =', test_out)
YourCNN(
  (feature_extractor): Sequential(
    (0): ResidualBlock(
      (main_path): Sequential(
        (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Dropout2d(p=0.2, inplace=False)
        (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): LeakyReLU(negative_slope=0.01)
        (4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Conv2d(3, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
      )
    )
    (1): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
    (2): ResidualBlock(
      (main_path): Sequential(
        (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Dropout2d(p=0.2, inplace=False)
        (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (3): LeakyReLU(negative_slope=0.01)
        (4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      )
      (shortcut_path): Sequential(
        (0): Identity()
      )
    )
    (3): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
  )
  (mlp): MLP(
    (hidden_layers): ModuleList(
      (0): Linear(in_features=3872, out_features=100, bias=True)
      (1): Linear(in_features=100, out_features=100, bias=True)
      (2): Linear(in_features=100, out_features=10, bias=True)
    )
    (activations): ModuleList(
      (0): LeakyReLU(negative_slope=0.01)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Identity()
    )
  )
)
out = tensor([[ -9.2310, -11.0149,  -6.9349,  13.4889,   7.3927,   8.8633,  10.5746,
         -10.7626, -25.2632,   5.0491]], grad_fn=<AddmmBackward0>)

Experiment 2 Configuration¶

Run your custom model on at least the following:

Configuratons:

  • K=[32, 64, 128] fixed with L=3,6,9,12 varying per run.

So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.

If you want, you can add some extra runs following the same pattern. Try to see how deep a model you can train.

Naming runs: Each run should be named exp2_L{}_K{}-{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp2_L3_K32-64-128.

TODO: Run the experiment on the above configuration with the YourCNN model. Make sure the result file names are as expected. Use the following blocks to display the results.

In [16]:
plot_exp_results('exp2*.json')
common config:  {'run_name': 'exp2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 400, 'epochs': 30, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 9, 'hidden_dims': [256], 'model_type': 'yours', 'kw': {}}
No description has been provided for this image

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.

In [17]:
from cs236781.answers import display_answer
import hw2.answers

Question 1¶

Analyze your results from experiment 1.1. In particular,

  1. Explain the effect of depth on the accuracy. What depth produces the best results and why do you think that's the case?
  2. Were there values of L for which the network wasn't trainable? what causes this? Suggest two things which may be done to resolve it at least partially.
In [18]:
display_answer(hw2.answers.part5_q1)

1. Analysis of the Effect of Depth ($L$)¶

In this experiment, we observed that increasing the depth of a plain CNN (without skip connections) does not linearly improve performance.

  • Optimal Depth: Typically, the best results are achieved with value of $L=4$. At these depth, the network has enough capacity to learn meaningful hierarchical features from the CIFAR-10 dataset while remaining shallow enough for the gradient to flow effectively from the output back to the input layers.
  • Performance Degradation: As $L$ increases to 8 and 16, the network's accuracy begins to drop. Even though a deeper model has more parameters and theoretically higher representational capacity, the difficulty of optimizing a deep "plain" stack outweighs the benefits of the added layers.

2. Non-trainable values of $L$ and the Vanishing Gradient¶

We observed that for higher values of $L$ (specifically $L=8$ and $L=16$), the network often became untrainable. This was characterized by flat-line accuracy (near 10%) and loss curves that did not decrease, eventually triggering early stopping.

The Cause: Vanishing Gradients In a plain CNN, the gradient is computed using the chain rule. During backpropagation, the error signal is multiplied by the weights and the derivatives of the activation functions at every layer. If intermediate derivatives are small (which is common with standard initialization), multiplying them 8 or 16 times causes the gradient to decay exponentially until it vanishes. The early layers receive no update, so the model never learns to extract basic features.

Suggested Resolutions:

  1. Residual Connections (Skip Connections): Introduce "shortcuts" that allow the gradient to bypass layers. By changing the layer to $y = F(x) + x$, the gradient can flow through the identity path ($+1$ in the derivative), ensuring it stays strong even at $L=16$.

  2. Batch Normalization: Adding Batchnorm layers after convolutions helps keep the activations in a range where the derivatives (e.g., of a ReLU or Tanh) are less likely to vanish or saturate, stabilizing the distribution of inputs to deeper layers and allowing for higher learning rates.

Question 2¶

Analyze your results from experiment 1.2. In particular, compare to the results of experiment 1.1.

In [19]:
display_answer(hw2.answers.part5_q2)

Analysis of Experiment 1.2: Filter Width ($K$) vs. Depth ($L$)¶

1. Case $L=2$:¶

  • The network is shallow enough to converge across all tested filter widths $K$, but we observe that **smaller filter counts ($K=32$) generalize better in this configuration.
  • U-Shape Loss: The test loss exhibits a distinct U-shape, reaching a minimum before rising. This indicates that as training continues, the added complexity of 64 or 128 filters introduces more parameters than the architecture can effectively regularize, causing the model to move past the point of optimal generalization.
  • Overfitting: Significant overfitting is evident as training accuracy approaches $90\%$ while test accuracy plateaus much lower, creating a visible generalization gap.

2. Case $L=4$:¶

At $L=4$, the relationship between width and performance flips; the network now requires more filters to reach its peak potential, though overfitting becomes even more pronounced.

  • $K=64$ and $K=128$ outperformed $K=32$ in test accuracy.
  • With 4 layers, the model has enough architectural depth to benefit from a higher number of features ($K=64, 128$). However, without regularization, the model uses its increased capacity to memorize the training set rather than generalizing.
  • U-Shape Loss: A very sharp U-shape is visible in the test loss for all $K$ values. The loss drops quickly but rebounds aggressively after approximately iteration 7. This confirms that higher capacity leads to faster divergence once the training set is memorized.
  • Overfitting: We observe extreme overfitting across all values of $K$. For $K=128$, training accuracy reaches nearly $100\%$, yet test accuracy remains around $70\%$, resulting in a massive generalization gap.

3. Case $L=8$:¶

  • All configurations ($K=32, 64, 128$) remained stuck at $\approx 10\%$ accuracy.
  • This confirms that width is irrelevant when depth creates an optimization barrier. In a plain CNN architecture without skip connections or normalization, the vanishing gradient problem prevents any learning from occurring at $L=8$, rendering the number of filters $K$ moot.

Comparison to Experiment 1.1¶

Comparing these results to Experiment 1.1 highlights how width and depth interact:

  1. Depth is the primary constraint: Just as in Experiment 1.1, increasing depth beyond a certain point ($L=8$) leads to total failure that no amount of width ($K$) can fix.
  2. Experiment 1.2 shows that the "best" $K$ depends on depth—$K=32$ was best for $L=2$, but $K=64$ was required to maximize the potential of $L=4$.
  3. While Experiment 1.1 focused on the depth-limit, Experiment 1.2 shows that even at stable depths, adding width ($K=128$) or depth ($L=4$) accelerates overfitting, leading to high training accuracy but stagnant or degrading test performance due to the lack of regularization.

Question 3¶

Analyze your results from experiment 1.3.

In [20]:
display_answer(hw2.answers.part5_q3)

Analysis of Experiment 1.3: Multi-Stage Filter Width ($K=[64, 128]$) vs. Depth ($L$)¶

1. Case $L=2$ and $L=3$:¶

Both the $L=2$ and $L=3$ configurations successfully converged, as the network depth was shallow enough for the gradient to propagate effectively despite the increased width.

  • Both models achieved high performance, with $L=2$ performing slightly better, peaking at approximately $73-74\%$ test accuracy, while $L=3$ reached approximately $69-70\%$.
  • Both configurations exhibit significant overfitting, characterized by a large gap between training accuracy (approaching $100\%$) and test accuracy.
  • A clear U-shape is visible in the test loss for both runs; the loss reaches a minimum before rebounding upward.
  • This confirms that as the model exhausts its ability to generalize, it uses its high filter capacity to memorize the training set, causing the test performance to degrade. While $L=3$ provides more parameters, the optimization difficulty of the extra layer results in slightly lower accuracy than the $L=2$ baseline.

2. Case $L=4$:¶

At $L=4$, the multi-stage architecture suffers a total training failure, highlighting the critical trade-off between filter width and network depth.

  • Both training and test accuracy flatline at exactly $10\%$, and the training loss remains stagnant near $2.30$.
  • This result demonstrates that increasing the width incrementally ($64 \to 128$) cannot compensate for excessive depth in a plain CNN architecture.
  • Even though $L=4$ was trainable with narrower filters in previous runs, the added complexity of wider filters at this depth triggers an earlier optimization collapse. The gradients fail to propagate through the (K*L)=8-layer wide stack.

Question 4¶

Analyze your results from experiment 1.4. Compare to experiment 1.1 and 1.3.

In [21]:
display_answer(hw2.answers.part5_q4)

Analysis of Experiment 1.4: The Impact of Skip Connections¶

In this experiment, we introduced skip connections (Residual connections) to resolve the optimization bottlenecks identified in previous experiments. These results clearly demonstrate that skip connections are the primary driver for training deeper networks effectively.

The most significant result is the successful training of deep architectures that previously suffered from total optimization collapse in Experiments 1.1 and 1.3.

  • Overcoming Vanishing Gradients: Unlike plain CNNs, where $L \ge 8$ resulted in stagnant $10\%$ accuracy, the Residual networks with $L=8, 16, \text{and } 32$ converged successfully.
  • Identity Mapping: Skip connections allow the gradient to bypass weight layers through the identity path $H(x) = F(x) + x$. This ensures that the error signal remains strong enough to update early layers even at extreme depths like $L=32$.

1. Case $K=32$ ($L=8, 16, 32$):¶

  • The Benefit of Depth: With skip connections, we see that $L=16$ (blue) achieves high test accuracy ($\approx 77\%$), outperforming the best results from Experiment 1.1.
  • Diminishing Returns: Interestingly, the $L=32$ model (orange) performs worse than $L=16$, plateauing around $68\%$ test accuracy. This suggests that while skip connections solve trainability, very deep plain-residual stacks may still struggle with generalization or require additional stabilization like Batch Normalization to leverage their full capacity.

2. Case $K=[64, 128, 256]$ ($L=2, 4, 8$):¶

This wide, multi-stage architecture achieved the best performance across all experiments.

  • The $L=4$ variant reached a peak test accuracy of $\approx 80\%$, proving that combining skip connections with increased width ($K$) provides the most robust architecture for CIFAR-10.
  • For $L=2$ and $L=4$, we observe a very sharp U-shape in the test loss. Because these models reach near $100\%$ training accuracy quickly, they begin to memorize the dataset, causing test loss to rebound aggressively after the initial drop.

Comparison to Previous Experiments¶

  1. Comparison to Experiment 1.1: In Experiment 1.1, depth was a liability beyond $L=4$. In Experiment 1.4, depth becomes an asset up to $L=16$, thanks to the stabilization provided by skip connections.
  2. Comparison to Experiment 1.3: Experiment 1.3 showed that increasing width made deep networks even more fragile, with failure occurring at $L=4$. In contrast, the Residual version of the wide architecture ($L=8, K=[64, 128, 256]$) trains perfectly and outperforms all plain models.
  3. Generalization Gap: Across all successful runs, the generalization gap remains the biggest challenge. While skip connections solved the trainability wall, the models still require early stopping or better regularization to manage the memorization of training noise.
In [22]:
display_answer(hw2.answers.part5_q5)

Question 5: Architecture Analysis of YourCNN and Experiment 2¶

1. Architectural Enhancements in YourCNN¶

To improve training stability and performance relative to the initial "plain" CNN models, we introduced several key architectural changes in the YourCNN class:

  • Residual (Skip) Connections: We utilized the ResidualBlock to implement skip connections ($H(x) = F(x) + x$). This allows gradients to bypass the weight layers through an identity path, fundamentally solving the vanishing gradient problem and allowing us to train deeper stacks that previously failed.
  • Integrated Batch Normalization: We enabled batchnorm=True within each block to normalize feature map statistics. This stabilizes the distribution of inputs to deeper layers, prevents activation drifting, and allows for more robust optimization.
  • Strategic Dropout (0.2): We applied a dropout rate of $0.2$ to provide regularization without starving the deep layers of signal. While higher dropout (0.5) was initially tested, the $0.2$ rate provided the necessary balance to prevent neuron co-adaptation while maintaining information flow in deep architectures.
  • Leaky ReLU Activation: We replaced standard ReLU with Leaky ReLU (slope=$0.01$). This ensures a small gradient flow for negative inputs, mitigating the "dying ReLU" problem and preserving signal during the backward pass.

2. Analysis of Experiment 2 Results¶

We evaluated this architecture using fixed pyramidal filters ($K=[32, 64, 128]$) across varying depths $L=3, 6, 9, 12$.

  • Successful Convergence at Extreme Depths: The introduction of residual connections and BatchNorm allowed all configurations to train effectively. Unlike previous experiments where $L=8$ or $L=12$ failed, these models show consistently decreasing training loss and increasing accuracy.
  • Performance Peak at $L=3$: The shallowest configuration ($L=3$, orange) achieved the highest overall performance, peaking at over $80\%$ test accuracy. As depth increased, we observed a steady decline in test performance, with $L=12$ (blue) reaching approximately $68\%$.
  • Generalization and Overfitting: All models reached high training accuracy (between $72\%$ and $90\%$), but the gap between training and test accuracy widened with depth. The test loss converge after 17-20 epochs from that point test accuracy the same, but the traning accuracy improved which indicate start of overfitting.

3. Comparison to Experiment 1.4 (Residual CNNs)¶

The YourCNN architecture is very similar to the Residual CNNs tested in Experiment 1.4, as both rely on skip connections to enable deep learning. However, this experiment places a stronger emphasis on generalization through the following improvements:

  • Improved Baseline: The $L=3$ configuration in YourCNN reached $\approx 81\%$ test accuracy, outperforming the best $K=32$ results from Experiment 1.4 ($\approx 78\%$). This improvement is largely attributed to the addition of Dropout, Batch Normalization and Leaky ReLU, which provide better generalization and internal stability than the basic residual structure alone.
  • Balanced Regularization: By making the dropout rate to $0.2$, YourCNN maintains a more consistent signal than the models in Experiment 1.4.
  • Experiment 1.4 proved that skip connections fix trainability, but YourCNN demonstrates that combining them with Dropout, BatchNorm and Leaky ReLU is necessary to improve generalization on the test set.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$

Part 6: YOLO - Objects Detection¶

In this part we will use an object detection architecture called YOLO (You only look once) to detect objects in images. We'll use an already trained model weights (v5) found here: https://github.com/ultralytics/yolov5

In [1]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Load the YOLO model
model = torch.hub.load("ultralytics/yolov5", "yolov5s")
model.to(device)
# Images
img1 = 'imgs/DolphinsInTheSky.jpg'  
img2 = 'imgs/cat-shiba-inu-2.jpg' 
cuda
Using cache found in /home/boaz.maron/.cache/torch/hub/ultralytics_yolov5_master
requirements: Ultralytics requirements ['gitpython>=3.1.30', 'urllib3>=2.5.0 ; python_version > "3.8"'] not found, attempting AutoUpdate...
WARNING ⚠️ Retry 1/2 failed: Command 'pip install --no-cache-dir "gitpython>=3.1.30" "urllib3>=2.5.0 ; python_version > "3.8"" ' returned non-zero exit status 2.
WARNING ⚠️ Retry 2/2 failed: Command 'pip install --no-cache-dir "gitpython>=3.1.30" "urllib3>=2.5.0 ; python_version > "3.8"" ' returned non-zero exit status 2.
WARNING ⚠️ requirements: ❌ Command 'pip install --no-cache-dir "gitpython>=3.1.30" "urllib3>=2.5.0 ; python_version > "3.8"" ' returned non-zero exit status 2.
YOLOv5 🚀 2025-12-25 Python-3.8.12 torch-1.10.1 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11012MiB)

Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
Adding AutoShape... 

Inference with YOLO¶

You are provided with 2 images (img1 and img2). TODO:

  1. Detect objects using the YOLOv5 model for these 2 images.
  2. Print the inference output with bounding boxes.
  3. Look at the inference results and answer the question below.
In [2]:
with torch.no_grad():
    #Detecting objects using YOLOv5
    objs_img1 = model(img1)
    objs_img2 = model(img2)

    #Printing inference output1 with bounding boxes
    print('inference output img 1:')
    print(objs_img1)
    print(objs_img1.pandas().xyxy[0])
    objs_img1.show()

    #Printing inference output2 with bounding boxes
    print('inference output img 2:')
    print(objs_img2)
    print(objs_img2.pandas().xyxy[0])
    objs_img2.show()
inference output img 1:
image 1/1: 183x275 2 persons, 1 surfboard
Speed: 16.9ms pre-process, 25.6ms inference, 2.9ms NMS per image at shape (1, 3, 448, 640)
         xmin       ymin        xmax        ymax  confidence  class       name
0  100.278389  47.370995  187.863541  118.462646    0.903491      0     person
1   22.531378  20.887983  128.905197   92.257042    0.500974      0     person
2   85.618011  98.014702  139.018326  124.627907    0.367016     37  surfboard
No description has been provided for this image
inference output img 2:
image 1/1: 750x750 2 cats, 1 dog
Speed: 22.1ms pre-process, 26.0ms inference, 1.8ms NMS per image at shape (1, 3, 640, 640)
         xmin        ymin        xmax        ymax  confidence  class name
0   11.568510  115.774994  312.757050  667.323608    0.655977     15  cat
1  363.715546  290.178497  750.000000  721.983887    0.509417     16  dog
2  311.190796  102.437439  595.031738  687.870728    0.391702     15  cat
No description has been provided for this image

Question 1¶

Analyze the inference results of the 2 images.

  1. How well did the model detect the objects in the pictures? with what confidance?
  2. What can possibly be the reason for the model failures? suggest methods to resolve that issue.
  3. recall that we learned how to fool a model by adverserial attack (PGD), describe how you would attack an Object Detection model (such as YOLO).
In [3]:
#====
from cs236781.answers import display_answer
import hw2.answers
#====

display_answer(hw2.answers.part6_q1)

Your answer:

  1. The model detection of the objects is bad.

    In both images, the bounding boxes the model produces where not very good. In the first image, he interpreted only a part of the dolphin as an object - not all of the dolphin. In the second image, he didnt produce a bounding box for the real cat.

    The model failed to classify correctly on almost all of the bounding boxes(failed on +80% of the boxes). Moreover, he got high confidence in some of these false classification. For example, in the first picture he classify the dolphins as humans with confidence of 0.9. Which make it hard to rely on such a model.

  2. There are several possible reasons for the model failures, one of them is:

    Domain shift. The model might have trained on clean images of dolphins, and not images on dolphins that are facing the sun. Or, it might be even that the training data did not had a dolphin class at all. Also, it might have trained on dogs that have more "dogs characteristics" then the provided Shiba Inu dogs(which are a little similar to cats). Therefore, because both of the photos are a little unique, it might be that the problems in classifications is duo to a domain shift.

    To resolve this, we should perform fine-tuning. We can train the model with a richer dataset that would include pictures of doplhins facing the sun, Shiba Inu dogs, and more unique pictures.

    Another possible way to resolve this issue is data augmantation. For example, in the dolphins picture, we might be able to increase lighting such that we could see the skin color of the dolphins. Then, the model might identify the dolphins as dolphins and not as humans.

  3. A PGD attack on an Object Detection model would itertively insert noise to input images in order to minimize performances. This includes minimzing the right class classification confidance, and bounding box selection. This will lead to missed detections or confidence misclassifications.

Creative Detection Failures¶

Object detection pitfalls could be, for example: occlusion - when the objects are partially occlude, and thus missing important features, model bias - when a model learn some bias about an object, it could recognize it as something else in a different setup, and many others like Deformation, Illumination conditions, Cluttered or textured background and blurring due to moving objects.

TODO: Take pictures and that demonstrates 3 of the above object detection pitfalls, run inference and analyze the results.

In [4]:
# You can change the images names in the code below, and add new ones

images = [
    "imgs/picture_01.jpg",
    "imgs/picture_02.png",
    "imgs/picture_03.png"
]

with torch.no_grad():
    for img in images:
        objs_img = model(img)
        print(objs_img)
        print(objs_img.pandas().xyxy[0])
        objs_img.show()
image 1/1: 2304x1856 1 apple
Speed: 89.1ms pre-process, 20.5ms inference, 1.6ms NMS per image at shape (1, 3, 640, 544)
          xmin       ymin         xmax         ymax  confidence  class   name
0  1091.855713  1717.4198  1357.054688  2013.912598    0.473363     47  apple
No description has been provided for this image
image 1/1: 1480x1369 1 bear, 1 sports ball
Speed: 139.2ms pre-process, 20.5ms inference, 1.4ms NMS per image at shape (1, 3, 640, 608)
         xmin       ymin         xmax         ymax  confidence  class  \
0  993.371948  27.739573  1206.332275   175.821182    0.428992     32   
1   65.861862  22.638054  1321.950562  1421.806396    0.288498     21   

          name  
0  sports ball  
1         bear  
No description has been provided for this image
image 1/1: 1350x1080 (no detections)
Speed: 79.0ms pre-process, 20.4ms inference, 0.9ms NMS per image at shape (1, 3, 640, 512)
Empty DataFrame
Columns: [xmin, ymin, xmax, ymax, confidence, class, name]
Index: []
No description has been provided for this image

Question 3¶

Analyize the results of the inference.

  1. How well did the model detect the objects in the pictures? explain.
In [5]:
display_answer(hw2.answers.part6_q3)

Your answer:

The model failed to detect correctly in all of the 3 attached pictures. We will explain the main Object Detection pitfalls the model experienced with each picture.

Picture 1: Occlusion. As we can see, the cat is sitting behind many plants that partially blocks it. The model failed therefore to classify the cat correcly, and wrongly detects an apple in the image.

Picture 2: Model Bias. The cat is not in its casual setting. It is underwater, and the light is bluish. The model was likely trained on images of cats that are not near water, so this represent a domain shift. As a resualt, in the given settings, the model is having problems classifying the cat correctly. Note that a lot of bears do live near water resources. Therefore it is possible that the trained data contained many pictures of bears with similar characteristics to this pictures(blue tint, watery textures and more..). This may explain why the model associated the features with the bear class.

Picture 3: Illumination conditions. The cat picture now is overexposed. It created two problems for the model:

  1. Texture loss, as the picture is a little blurred and many of the pixels turned into pure white.
  2. The model is probably not trained on overexposed picture. Therefore this picture is out of the original data distribution.

Bonus¶

Try improving the model performance over poorly recognized images by changing them. Describe the manipulations you did to the pictures.

In [6]:
import cv2

#insert bonus code here
In [7]:
display_answer(hw2.answers.part6_bonus)

Your answer:

Write your answer using markdown and $\LaTeX$:

# A code block
a = 2

An equation: $e^{i\pi} -1 = 0$

In [ ]: